phi3_mini_tokenizer¶

torchtune.models.phi3.phi3_mini_tokenizer(path: str, special_tokens_path: Optional[str] = None, max_seq_len: Optional[int] = None, prompt_template: Optional[Union[str, Dict[Literal['system', 'user', 'assistant', 'ipython'], Tuple[str, str]]]] = None) → Phi3MiniTokenizer[source]¶

Phi-3 Mini tokenizer. Ref: https://huggingface.co/microsoft/Phi-3-mini-4k-instruct/blob/main/tokenizer_config.json

Parameters:

path (str) – Path to the SPM tokenizer model.
special_tokens_path (Optional[str]) – Path to tokenizer.json from Hugging Face model files that contains all registered special tokens, or a local json file structured similarly. Default is None to use the canonical Phi3 special tokens.
max_seq_len (Optional[int]) – maximum sequence length for tokenizing a single list of messages, after which the input will be truncated. Default is None.
prompt_template (Optional[_TemplateType]) – optional specified prompt template. If a string, it is assumed to be the dotpath of a PromptTemplateInterface class. If a dictionary, it is assumed to be a custom prompt template mapping role to the prepend/append tags.

Note

This tokenizer includes typical LM EOS and BOS tokens like <s>, </s>, and <unk>. However, to support chat completion, it is also augmented with special tokens like <endoftext> and <assistant>.

Returns:: Instantiation of the SPM tokenizer.
Return type:: Phi3MiniSentencePieceBaseTokenizer

phi3_mini_tokenizer¶

Docs

Tutorials

Resources