tokenize_messages_no_special_tokens¶
- torchtune.modules.tokenizers.tokenize_messages_no_special_tokens(tokenizer: ModelTokenizer, messages: List[Message], *, bos_id: Optional[int] = None, eos_id: Optional[int] = None) Tuple[List[int], List[bool]] [source]¶
Tokenize a list of messages one at a time then concatenate them, returning a list of tokens and a list of masks. Does not add any special tokens except for BOS and EOS (if provided). This serves as a common starting point for model tokenizers that do not rely heavily on special tokens.
Examples
>>> messages = [ ... Message(role="system", content="system message\n", masked=True), ... Message(role="user", content="user prompt\n", masked=True), ... Message(role="assistant", content="assistant response\n"), ... ] # tokenize_messages encodes messages separately and concats >>> tokens = tokenize_messages_no_special_tokens( ... tokenizer, ... messages, ... bos_id=tokenizer.bos_id, ... eos_id=tokenizer.eos_id, ... )[0] >>> print(tokens) [1, 1788, 2643, 13, 1792, 9508, 13, 465, 22137, 2933, 2] # Same result as encoding the full string in one go >>> print(tokenizer.encode(''.join([message.content for message in messages]))) [1, 1788, 2643, 13, 1792, 9508, 13, 465, 22137, 2933, 2]
- Parameters:
tokenizer (ModelTokenizer) – Tokenizer to encode messages with.
messages (List[Message]) – A list of messages, each containing role, content, and masked attributes.
bos_id (Optional[int]) – Beginning-of-sequence token id. If None, no BOS token will be added. Default None.
eos_id (Optional[int]) – End-of-sequence token id. If None, no EOS token will be added. Default None.
- Returns:
The tokenized messages.
- Return type:
- Raises:
RuntimeError – if any message in
messages
does not satisfymessage['type'] == 'text'
.