SentencePieceTokenizer¶

class torchtune.modules.tokenizers.SentencePieceTokenizer(path: str)[source]¶

A wrapper around SentencePieceProcessor.

Parameters:: path (str) – Path to pretrained tokenizer file.

Example

# Accepts only non-batched input for now >>> tokenizer = SentencePieceTokenizer(“/path/to/spm_model”) >>> tokenized_text = SentencePieceTokenizer.encode(“Hello world!”, add_bos=True, add_eos=True) >>> print(tokenized_text) [1, 31587, 29644, 102, 2]

decode(ids: List[int]) → str[source]¶

Decode token IDs to strings.

Parameters:: ids (List[int]) – The input token IDs to be decoded.
Returns:: The decoded text.
Return type:: str

encode(text: str, add_bos: bool = True, add_eos: bool = True, trim_leading_whitespace: bool = False, prefix: Optional[str] = None) → List[int][source]¶

Encode text into token IDs.

Parameters:

text (str) – The input text to be encoded, unbatched.
add_bos (bool) – Whether to prepend BOS to the input, defaults to True.
add_eos (bool) – Whether to append EOS to the input, defaults to True.
trim_leading_whitespace (bool) – Whether to trim leading whitespace from underlying sentencepiece tokenization. Sentencepiece normally prepends whitespace to any tokenized text, which can cause differences where encode(s1) + encode(s2) != encode(s1 + s2) due to leading whitespace added to s2. Default: False
prefix (Optional[str]) – Optional string to encode for trimming leading whitespaces. Used only if trim_leading_whitespace=True. Default: None

Returns:

The encoded token IDs.

Return type:

List[int]

tokenize_messages(messages: List[Message], max_seq_len: Optional[int] = None) → Tuple[List[int], List[bool]][source]¶

Tokenize a list of messages one at a time then concatenate them, returning a list of tokens and a list of masks.

Note: llama2 sentencepiece has problems where in general encode(s1 + s2) != encode(s1) + encode(s2) due to whitespace handling. We can get around this by prepending s2 with a known token and slicing the beginning off the tokenized s2.

Example

>>> tokenizer = SentencePieceTokenizer(tokenizer_path)
>>> messages = [
    Message(role="system", content="system message\n", masked=True),
    Message(role="user", content="user prompt\n", masked=True),
    Message(role="assistant", content="assistant response\n"),
    ]
# tokenize_messages encodes messages separately and concats
>>> tokenizer.tokenize_messages(messages, max_seq_len)[0]
[1, 1788, 2643, 13, 1792, 9508, 13, 465, 22137, 2933, 2]

# Same result as encoding the full string in one go >>> tokenizer.encode(‘’.join([message.content for message in messages])) [1, 1788, 2643, 13, 1792, 9508, 13, 465, 22137, 2933, 2]

Parameters:

messages (List[Message]) – A list of messages, each containing role, content, and masked attributes.
max_seq_len (Optional[int]) – A max sequence length to truncate tokens to. Default: None

Returns:

The tokenized messages

Return type:

Tuple[List[int], List[bool]]

SentencePieceTokenizer¶

Docs

Tutorials

Resources