TikTokenTokenizer¶
- class torchtune.modules.tokenizers.TikTokenTokenizer(path: str, *, name: str = 'llama3_tiktoken', pattern: str = "(?i:'s|'t|'re|'ve|'m|'ll|'d)|[^\\r\\n\\p{L}\\p{N}]?\\p{L}+|\\p{N}{1,3}| ?[^\\s\\p{L}\\p{N}]+[\\r\\n]*|\\s*[\\r\\n]+|\\s+(?!\\S)|\\s+", all_special_tokens: Optional[List[str]] = None, bos_token: str = '<|begin_of_text|>', eos_token: str = '<|end_of_text|>', start_header_id: str = '<|start_header_id|>', end_header_id: str = '<|end_header_id|>', step_id: str = '<|step_id|>', eom_id: str = '<|eom_id|>', eot_id: str = '<|eot_id|>', python_tag: str = '<|python_tag|>')[source]¶
A wrapper around tiktoken Encoding.
- Parameters:
path (str) – Path to pretrained tokenizer checkpoint file.
name (str) – Name of the tokenizer (used by tiktoken for identification).
pattern (str) – Regex pattern used to for string parsing.
all_special_tokens (Optional[List[str]]) – List of all special tokens. First element must be bos token, second element must be eos token, final element must be python tag. All elements must be unique. Length must be at most 256. Default: None (will use ALL_SPECIAL_TOKENS)
bos_token (str) – Beginning of sequence token. Defaults to BEGIN_OF_TEXT.
eos_token (str) – End of sequence token. Defaults to END_OF_TEXT.
start_header_id (str) – Start header token. Defaults to START_HEADER_ID.
end_header_id (str) – End header token. Defaults to END_HEADER_ID.
step_id (str) – Step token. Defaults to STEP_ID.
eom_id (str) – End of message token. Defaults to EOM_ID.
eot_id (str) – End of turn token. Defaults to EOT_ID.
python_tag (str) – Python tag token. Defaults to PYTHON_TAG.
- decode(token_ids: List[int], truncate_at_eos: bool = True) str [source]¶
Decode a list of token ids into a string.
- encode(text: str, add_bos: bool, add_eos: bool) List[int] [source]¶
Encode a string into a list of token ids. Assumes that the string contains no special tokens.
- tokenize_message(message: Message, tokenize_header: bool = False) List[int] [source]¶
Tokenize a message into a list of token ids.