TikTokenBaseTokenizer¶
- class torchtune.modules.tokenizers.TikTokenBaseTokenizer(path: str, name: str, pattern: str, bos_id: int, eos_id: int, special_tokens: Dict[str, int])[source]¶
A lightweight wrapper around tiktoken Encoding. This class additionally handles breaking up the input text into substrings of a max length and splitting up long repetitions to improve encode speed.
- Parameters:
path (str) – Path to pretrained tokenizer checkpoint file.
name (str) – Name of the tokenizer (used by tiktoken for identification).
pattern (str) – Regex pattern used to split input text into chunks before passing to byte-pair encoding.
bos_id (int) – beginning-of-sequence token id. This can be present or absent in
special_tokens
.eos_id (int) – end-of-sequence token id. This can be present or absent in
special_tokens
.special_tokens (Dict[str, int]) – Mapping of special tokens to their ids.
Examples
>>> tokenizer = TikTokenBaseTokenizer("/path/to/tt_model") >>> tokenized_text = tokenizer.encode("Hello world!", add_bos=True, add_eos=True) >>> print(tokenized_text) [1, 31587, 29644, 102, 2]