Shortcuts

tokenize_messages_no_special_tokens

torchtune.modules.tokenizers.tokenize_messages_no_special_tokens(tokenizer: ModelTokenizer, messages: List[Message], bos_id: int, eos_id: int, max_seq_len: Optional[int] = None) Tuple[List[int], List[bool]][source]

Tokenize a list of messages one at a time then concatenate them, returning a list of tokens and a list of masks. Does not add any special tokens except for BOS and EOS. This serves as a common starting point for model tokenizers that do not rely heavily on special tokens.

Examples

>>> messages = [
...     Message(role="system", content="system message\n", masked=True),
...     Message(role="user", content="user prompt\n", masked=True),
...     Message(role="assistant", content="assistant response\n"),
... ]
# tokenize_messages encodes messages separately and concats
>>> tokens = tokenize_messages_no_special_tokens(
...     tokenizer,
...     messages,
...     tokenizer.bos_id,
...     tokenizer.eos_id,
...     max_seq_len
... )[0]
>>> print(tokens)
[1, 1788, 2643, 13, 1792, 9508, 13, 465, 22137, 2933, 2]
# Same result as encoding the full string in one go
>>> print(tokenizer.encode(''.join([message.content for message in messages])))
[1, 1788, 2643, 13, 1792, 9508, 13, 465, 22137, 2933, 2]
Parameters:
  • messages (List[Message]) – A list of messages, each containing role, content, and masked attributes.

  • max_seq_len (Optional[int]) – A max sequence length to truncate tokens to. Default: None

Returns:

The tokenized messages

Return type:

Tuple[List[int], List[bool]]

Docs

Access comprehensive developer documentation for PyTorch

View Docs

Tutorials

Get in-depth tutorials for beginners and advanced developers

View Tutorials

Resources

Find development resources and get your questions answered

View Resources