.. _tokenizers_usage_label: ========== Tokenizers ========== Tokenizers are a key component of any LLM. They convert raw text into token IDs, which index into embedding vectors that are understood by the model. In torchtune, tokenizers play the role of converting :class:`~torchtune.data.Message` objects into token IDs and any necessary model-specific special tokens. .. code-block:: python from torchtune.data import Message from torchtune.models.phi3 import phi3_mini_tokenizer sample = { "input": "user prompt", "output": "model response", } msgs = [ Message(role="user", content=sample["input"]), Message(role="assistant", content=sample["output"]) ] p_tokenizer = phi3_mini_tokenizer("/tmp/Phi-3-mini-4k-instruct/tokenizer.model") tokens, mask = p_tokenizer.tokenize_messages(msgs) print(tokens) # [1, 32010, 29871, 13, 1792, 9508, 32007, 29871, 13, 32001, 29871, 13, 4299, 2933, 32007, 29871, 13] print(p_tokenizer.decode(tokens)) # '\nuser prompt \n \nmodel response \n' Model tokenizers are usually based on an underlying byte-pair encoding algorithm, such as SentencePiece or TikToken, which are both supported in torchtune. Downloading tokenizers from Hugging Face ---------------------------------------- Models hosted on Hugging Face are also distributed with the tokenizers they were trained with. These are automatically downloaded alongside model weights when using ``tune download``. For example, this command downloads the Mistral-7B model weights and tokenizer: .. code-block:: bash tune download mistralai/Mistral-7B-v0.1 --output-dir /tmp/Mistral-7B-v0.1 --hf-token <HF_TOKEN> cd /tmp/Mistral-7B-v0.1/ ls tokenizer.model # tokenizer.model Loading tokenizers from file ---------------------------- Once you've downloaded the tokenizer file, you can load it into the corresponding tokenizer class by pointing to the file path of the tokenizer model in your config or in the constructor. You can also pass in a custom file path if you've already downloaded it to a different location. .. code-block:: python # In code from torchtune.models.mistral import mistral_tokenizer m_tokenizer = mistral_tokenizer("/tmp/Mistral-7B-v0.1/tokenizer.model") type(m_tokenizer) # <class 'torchtune.models.mistral._tokenizer.MistralTokenizer'> .. code-block:: yaml # In config tokenizer: _component_: torchtune.models.mistral.mistral_tokenizer path: /tmp/Mistral-7B-v0.1/tokenizer.model Setting max sequence length --------------------------- Setting max sequence length can give you control over memory usage and adhere to model specifications. .. code-block:: python # In code from torchtune.models.mistral import mistral_tokenizer m_tokenizer = mistral_tokenizer("/tmp/Mistral-7B-v0.1/tokenizer.model", max_seq_len=8192) # Set an arbitrarily small seq len for demonstration from torchtune.data import Message m_tokenizer = mistral_tokenizer("/tmp/Mistral-7B-v0.1/tokenizer.model", max_seq_len=7) msg = Message(role="user", content="hello world") tokens, mask = m_tokenizer.tokenize_messages([msg]) print(len(tokens)) # 7 print(tokens) # [1, 733, 16289, 28793, 6312, 28709, 2] print(m_tokenizer.decode(tokens)) # '[INST] hello' .. code-block:: yaml # In config tokenizer: _component_: torchtune.models.mistral.mistral_tokenizer path: /tmp/Mistral-7B-v0.1/tokenizer.model max_seq_len: 8192 Prompt templates ---------------- Prompt templates are enabled by passing it into any model tokenizer. See :ref:`prompt_templates_usage_label` for more details. Special tokens -------------- Special tokens are model-specific tags that are required to prompt the model. They are different from prompt templates because they are assigned their own unique token IDs. For an extended discussion on the difference between special tokens and prompt templates, see :ref:`prompt_templates_usage_label`. Special tokens are automatically added to your data by the model tokenizer and do not require any additional configuration from you. You also have the ability to customize the special tokens for experimentation by passing in a file path to the new special tokens mapping in a JSON file. This will NOT modify the underlying ``tokenizer.model`` to support the new special token ids - it is your responsibility to ensure that the tokenizer file encodes it correctly. Note also that some models require the presence of certain special tokens for proper usage, such as the ``"<|eot_id|>"`` in Llama3 Instruct. For example, here we change the ``"<|begin_of_text|>"`` and ``"<|end_of_text|>"`` token IDs in Llama3 Instruct: .. code-block:: python # tokenizer/special_tokens.json { "added_tokens": [ { "id": 128257, "content": "<|begin_of_text|>", }, { "id": 128258, "content": "<|end_of_text|>", }, # Remaining required special tokens ... ] } .. code-block:: python # In code from torchtune.models.llama3 import llama3_tokenizer tokenizer = llama3_tokenizer( path="/tmp/Meta-Llama-3-8B-Instruct/original/tokenizer.model", special_tokens_path="tokenizer/special_tokens.json", ) print(tokenizer.special_tokens) # {'<|begin_of_text|>': 128257, '<|end_of_text|>': 128258, ...} .. code-block:: yaml # In config tokenizer: _component_: torchtune.models.llama3.llama3_tokenizer path: /tmp/Meta-Llama-3-8B-Instruct/original/tokenizer.model special_tokens_path: tokenizer/special_tokens.json .. _base_tokenizers: Base tokenizers --------------- :class:`~torchtune.modules.transforms.tokenizers.BaseTokenizer` are the underlying byte-pair encoding modules that perform the actual raw string to token ID conversion and back. In torchtune, they are required to implement ``encode`` and ``decode`` methods, which are called by the :ref:`model_tokenizers` to convert between raw text and token IDs. .. code-block:: python class BaseTokenizer(Protocol): def encode(self, text: str, **kwargs: Dict[str, Any]) -> List[int]: """ Given a string, return the encoded list of token ids. Args: text (str): The text to encode. **kwargs (Dict[str, Any]): kwargs. Returns: List[int]: The encoded list of token ids. """ pass def decode(self, token_ids: List[int], **kwargs: Dict[str, Any]) -> str: """ Given a list of token ids, return the decoded text, optionally including special tokens. Args: token_ids (List[int]): The list of token ids to decode. **kwargs (Dict[str, Any]): kwargs. Returns: str: The decoded text. """ pass If you load any :ref:`model_tokenizers`, you can see that it calls its underlying :class:`~torchtune.modules.transforms.tokenizers.BaseTokenizer` to do the actual encoding and decoding. .. code-block:: python from torchtune.models.mistral import mistral_tokenizer from torchtune.modules.transforms.tokenizers import SentencePieceBaseTokenizer m_tokenizer = mistral_tokenizer("/tmp/Mistral-7B-v0.1/tokenizer.model") # Mistral uses SentencePiece for its underlying BPE sp_tokenizer = SentencePieceBaseTokenizer("/tmp/Mistral-7B-v0.1/tokenizer.model") text = "hello world" print(m_tokenizer.encode(text)) # [1, 6312, 28709, 1526, 2] print(sp_tokenizer.encode(text)) # [1, 6312, 28709, 1526, 2] .. _hf_tokenizers: Using Hugging Face tokenizers ----------------------------- Sometimes tokenizers hosted on Hugging Face do not contain files compatible with one of torchtune's existing tokenizer classes. In this case, we provide :class:`~torchtune.modules.transforms.tokenizers.HuggingFaceBaseTokenizer` to parse the Hugging Face ``tokenizer.json`` file and define the correct ``encode`` and ``decode`` methods to match torchtune's other :class:`~torchtune.modules.transforms.tokenizers.BaseTokenizer` classes. You should also pass the path to either ``tokenizer_config.json`` or ``generation_config.json``, which will allow torchtune to infer BOS and EOS tokens. Continuing with the Mistral example: .. code-block:: python hf_tokenizer = HuggingFaceBaseTokenizer( tokenizer_json_path="/tmp/Mistral-7B-v0.1/tokenizer.json", tokenizer_config_json_path="/tmp/Mistral-7B-v0.1/tokenizer_config.json", ) text = "hello world" print(hf_tokenizer.encode(text)) # [1, 6312, 28709, 1526, 2] .. _model_tokenizers: Model tokenizers ---------------- :class:`~torchtune.modules.transforms.tokenizers.ModelTokenizer` are specific to a particular model. They are required to implement the ``tokenize_messages`` method, which converts a list of Messages into a list of token IDs. .. code-block:: python class ModelTokenizer(Protocol): special_tokens: Dict[str, int] max_seq_len: Optional[int] def tokenize_messages( self, messages: List[Message], **kwargs: Dict[str, Any] ) -> Tuple[List[int], List[bool]]: """ Given a list of messages, return a list of tokens and list of masks for the concatenated and formatted messages. Args: messages (List[Message]): The list of messages to tokenize. **kwargs (Dict[str, Any]): kwargs. Returns: Tuple[List[int], List[bool]]: The list of token ids and the list of masks. """ pass The reason they are model specific and different from :ref:`base_tokenizers` is because they add all the necessary special tokens or prompt templates required to prompt the model. .. code-block:: python from torchtune.models.mistral import mistral_tokenizer from torchtune.modules.transforms.tokenizers import SentencePieceBaseTokenizer from torchtune.data import Message m_tokenizer = mistral_tokenizer("/tmp/Mistral-7B-v0.1/tokenizer.model") # Mistral uses SentencePiece for its underlying BPE sp_tokenizer = SentencePieceBaseTokenizer("/tmp/Mistral-7B-v0.1/tokenizer.model") text = "hello world" msg = Message(role="user", content=text) tokens, mask = m_tokenizer.tokenize_messages([msg]) print(tokens) # [1, 733, 16289, 28793, 6312, 28709, 1526, 28705, 733, 28748, 16289, 28793] print(sp_tokenizer.encode(text)) # [1, 6312, 28709, 1526, 2] print(m_tokenizer.decode(tokens)) # [INST] hello world [/INST] print(sp_tokenizer.decode(sp_tokenizer.encode(text))) # hello world