SentencePieceBaseTokenizer

class torchtune.modules.tokenizers.SentencePieceBaseTokenizer(path: str)[source]

A light-weight wrapper around SentencePieceProcessor that additionally handles trimming leading whitespaces.

Parameters:: path (str) – Path to pretrained tokenizer file.

Examples

>>> tokenizer = SentencePieceBaseTokenizer("/path/to/spm_model")
>>> tokenized_text = tokenizer.encode("Hello world!", add_bos=True, add_eos=True)
>>> print(tokenized_text)
[1, 31587, 29644, 102, 2]

decode(ids: List[int]) → str[source]

Decode token IDs to strings.

Parameters:: ids (List[int]) – The input token IDs to be decoded.
Returns:: The decoded text.
Return type:: str

encode(text: str, add_bos: bool = True, add_eos: bool = True, trim_leading_whitespace: bool = False, prefix: Optional[str] = None) → List[int][source]

Encode text into token IDs.

Parameters:

text (str) – The input text to be encoded, unbatched.
add_bos (bool) – Whether to prepend BOS to the input, defaults to True.
add_eos (bool) – Whether to append EOS to the input, defaults to True.
trim_leading_whitespace (bool) – Whether to trim leading whitespace from underlying sentencepiece tokenization. Sentencepiece normally prepends whitespace to any tokenized text, which can cause differences where encode(s1) + encode(s2) != encode(s1 + s2) due to leading whitespace added to s2. This will only trim leading whitespace if the underlying SentencePieceProcessor encodes whitespace. Default: False
prefix (Optional[str]) – Optional string to encode for trimming leading whitespaces. Used only if trim_leading_whitespace=True. Default: None

Returns:

The encoded token IDs.

Return type:

List[int]

SentencePieceBaseTokenizer

Docs

Tutorials

Resources