Llama3Tokenizer

class torchtune.models.llama3.Llama3Tokenizer(path: str, special_tokens: Optional[Dict[str, int]] = None)[source]

tiktoken tokenizer configured with Llama3 Instruct’s special tokens, as described in https://llama.meta.com/docs/model-cards-and-prompt-formats/meta-llama-3

Parameters:

path (str) – Path to pretrained tiktoken tokenizer file.
special_tokens (Optional[Dict[str, int]]) – mapping containing special text tokens and their registered token IDs. If left as None, this will be set to the canonical Llama3 special tokens.

Examples

>>> tokenizer = Llama3Tokenizer("/path/to/tt_model")
>>> tokenized_text = tokenizer.encode("Hello world!", add_bos=True, add_eos=True)
>>> print(tokenized_text)
[1, 31587, 29644, 102, 2]

decode(token_ids: List[int], truncate_at_eos: bool = True) → str[source]

Decode a list of token ids into a string.

Parameters:

token_ids (List[int]) – The list of token ids.
truncate_at_eos (bool) – Whether to truncate the string at the end of sequence token. Default is True.

Returns:

The decoded string.

Return type:

str

tokenize_message(message: Message, tokenize_header: bool = False) → List[int][source]

Tokenize a message into a list of token ids.

Parameters:

message (Message) – The message to tokenize.
tokenize_header (bool) – Whether to prepend a tokenized header to each message.

Returns:

The list of token ids.

Return type:

List[int]

tokenize_messages(messages: List[Message], max_seq_len: Optional[int] = None, tokenize_header: bool = True, add_eos: bool = True) → Tuple[List[int], List[bool]][source]

Tokenize a list of messages into a list of token ids and masks.

Parameters:

messages (List[Message]) – The list of messages to tokenize.
max_seq_len (Optional[int]) – The maximum sequence length.
tokenize_header (bool) – Whether to prepend a tokenized header to each message.

Returns:

The list of token ids and the list of masks.

Return type:

Tuple[List[int], List[bool]]

Llama3Tokenizer

Docs

Tutorials

Resources