Shortcuts

TikTokenTokenizer

class torchtune.modules.tokenizers.TikTokenTokenizer(path: str, *, name: str = 'llama3_tiktoken', pattern: str = "(?i:'s|'t|'re|'ve|'m|'ll|'d)|[^\\r\\n\\p{L}\\p{N}]?\\p{L}+|\\p{N}{1,3}| ?[^\\s\\p{L}\\p{N}]+[\\r\\n]*|\\s*[\\r\\n]+|\\s+(?!\\S)|\\s+", all_special_tokens: Optional[List[str]] = None, bos_token: str = '<|begin_of_text|>', eos_token: str = '<|end_of_text|>', start_header_id: str = '<|start_header_id|>', end_header_id: str = '<|end_header_id|>', step_id: str = '<|step_id|>', eom_id: str = '<|eom_id|>', eot_id: str = '<|eot_id|>', python_tag: str = '<|python_tag|>')[source]

A wrapper around tiktoken Encoding.

Parameters:
  • path (str) – Path to pretrained tokenizer checkpoint file.

  • name (str) – Name of the tokenizer (used by tiktoken for identification).

  • pattern (str) – Regex pattern used to for string parsing.

  • all_special_tokens (Optional[List[str]]) – List of all special tokens. First element must be bos token, second element must be eos token, final element must be python tag. All elements must be unique. Length must be at most 256. Default: None (will use ALL_SPECIAL_TOKENS)

  • bos_token (str) – Beginning of sequence token. Defaults to BEGIN_OF_TEXT.

  • eos_token (str) – End of sequence token. Defaults to END_OF_TEXT.

  • start_header_id (str) – Start header token. Defaults to START_HEADER_ID.

  • end_header_id (str) – End header token. Defaults to END_HEADER_ID.

  • step_id (str) – Step token. Defaults to STEP_ID.

  • eom_id (str) – End of message token. Defaults to EOM_ID.

  • eot_id (str) – End of turn token. Defaults to EOT_ID.

  • python_tag (str) – Python tag token. Defaults to PYTHON_TAG.

decode(token_ids: List[int], truncate_at_eos: bool = True) str[source]

Decode a list of token ids into a string.

Parameters:
  • token_ids (List[int]) – The list of token ids.

  • truncate_at_eos (bool) – Whether to truncate the string at the end of sequence token.

Returns:

The decoded string.

Return type:

str

encode(text: str, add_bos: bool, add_eos: bool) List[int][source]

Encode a string into a list of token ids. Assumes that the string contains no special tokens.

Parameters:
  • text (str) – The string to encode.

  • add_bos (bool) – Whether to add the beginning of sequence token.

  • add_eos (bool) – Whether to add the end of sequence token.

Returns:

The list of token ids.

Return type:

List[int]

tokenize_message(message: Message, tokenize_header: bool = False) List[int][source]

Tokenize a message into a list of token ids.

Parameters:
  • message (Message) – The message to tokenize.

  • tokenize_header (bool) – Whether to prepend a tokenized header to each message.

Returns:

The list of token ids.

Return type:

List[int]

tokenize_messages(messages: List[Message], max_seq_len: Optional[int] = None, tokenize_header: bool = True) Tuple[List[int], List[bool]][source]

Tokenize a list of messages into a list of token ids and masks.

Parameters:
  • messages (List[Message]) – The list of messages to tokenize.

  • max_seq_len (Optional[int]) – The maximum sequence length.

  • tokenize_header (bool) – Whether to prepend a tokenized header to each message.

Returns:

The list of token ids and the list of masks.

Return type:

Tuple[List[int], List[bool]]

Docs

Access comprehensive developer documentation for PyTorch

View Docs

Tutorials

Get in-depth tutorials for beginners and advanced developers

View Tutorials

Resources

Find development resources and get your questions answered

View Resources