TikTokenBaseTokenizer¶

class torchtune.modules.tokenizers.TikTokenBaseTokenizer(path: str, name: str, pattern: str, bos_id: int, eos_id: int, special_tokens: Dict[str, int])[source]¶

A lightweight wrapper around tiktoken Encoding. This class additionally handles breaking up the input text into substrings of a max length and splitting up long repetitions to improve encode speed.

Parameters:

path (str) – Path to pretrained tokenizer checkpoint file.
name (str) – Name of the tokenizer (used by tiktoken for identification).
pattern (str) – Regex pattern used to split input text into chunks before passing to byte-pair encoding.
bos_id (int) – beginning-of-sequence token id. This can be present or absent in special_tokens.
eos_id (int) – end-of-sequence token id. This can be present or absent in special_tokens.
special_tokens (Dict[str, int]) – Mapping of special tokens to their ids.

Examples

>>> tokenizer = TikTokenBaseTokenizer("/path/to/tt_model")
>>> tokenized_text = tokenizer.encode("Hello world!", add_bos=True, add_eos=True)
>>> print(tokenized_text)
[1, 31587, 29644, 102, 2]

decode(token_ids: List[int], truncate_at_eos: bool = True) → str[source]¶

Decode a list of token ids into a string.

Parameters:

token_ids (List[int]) – The list of token ids.
truncate_at_eos (bool) – Whether to truncate the string at the end of sequence token. Default is True.

Returns:

The decoded string.

Return type:

str

encode(text: str, add_bos: bool = True, add_eos: bool = True) → List[int][source]¶

Encode a string into a list of token ids. Assumes that the string contains no special tokens.

Parameters:

text (str) – The string to encode.
add_bos (bool) – Whether to add the tokenizer’s bos_id to the encoded string. Default True.
add_eos (bool) – Whether to add the tokenizer’s eos_id to the encoded string. Default True.

Returns:

The list of token ids.

Return type:

List[int]

TikTokenBaseTokenizer¶

Docs

Tutorials

Resources