Phi3MiniTokenizer

class torchtune.models.phi3.Phi3MiniTokenizer(path: str, special_tokens: Optional[Dict[str, int]] = None)[source]

SentencePiece tokenizer configured with Phi3 Mini’s special tokens.

Parameters:

path (str) – Path to pretrained tokenizer file.
special_tokens (Optional[Dict[str, int]]) – mapping containing special text tokens and their registered token IDs. If left as None, this will be set to the canonical Phi3 special tokens.

Examples

>>> tokenizer = Phi3MiniTokenizer("/path/to/spm_model")
>>> tokenized_text = tokenizer.encode("Hello world!", add_bos=True, add_eos=True)
>>> print(tokenized_text)
[1, 31587, 29644, 102, 2]

decode(ids: List[int]) → str[source]

Decode token IDs to strings.

Parameters:: ids (List[int]) – The input token IDs to be decoded.
Returns:: The decoded text.
Return type:: str

tokenize_messages(messages: List[Message], max_seq_len: Optional[int] = None, *, add_eos: bool = False, ignore_system_prompts: bool = True) → Tuple[List[int], List[bool]][source]

Tokenize a list of messages one at a time then concatenate them, returning a list of tokens and a list of masks.

Example

>>> tokenizer = Phi3MiniTokenizer(tokenizer_path)
>>> messages = [
    Message(role="system", content="system message\n", masked=True),
    Message(role="user", content="user prompt\n", masked=True),
    Message(role="assistant", content="assistant response\n"),
]

# tokenize_messages encodes messages separately and concats >>> tokenizer.tokenize_messages(messages, max_seq_len)[0] [1, 1788, 2643, 13, 1792, 9508, 13, 465, 22137, 2933, 2]

>>> # Same result as encoding the full string in one go
>>> tokenizer.encode(''.join([message.content for message in messages]))
[1, 1788, 2643, 13, 1792, 9508, 13, 465, 22137, 2933, 2]

Parameters:

messages (List[Message]) – A list of messages, each containing role, content, and masked attributes.
max_seq_len (Optional[int]) – A max sequence length to truncate tokens to. Default: None
add_eos (bool) – Whether to append EOS after assistant message, default to False
ignore_system_prompts (bool) – Whether to ignore system prompts. This matches the HF implementation, default to True.

Raises:

ValueError – If the role is not “user”, “assistant”, or “system”.

Returns:

The tokenized messages

Return type:

Tuple[List[int], List[bool]]

Phi3MiniTokenizer

Docs

Tutorials

Resources