Shortcuts

HuggingFaceBaseTokenizer

class torchtune.modules.transforms.tokenizers.HuggingFaceBaseTokenizer(tokenizer_json_path: str, *, tokenizer_config_json_path: Optional[str] = None, generation_config_path: Optional[str] = None)[source]

A wrapper around Hugging Face tokenizers. See https://github.com/huggingface/tokenizers This can be used to load from a Hugging Face tokenizer.json file into a torchtune BaseTokenizer.

This class will load the tokenizer.json file from tokenizer_json_path. It will attempt to infer BOS and EOS token IDs from config.json if possible, and if not will fallback to inferring them from generation_config.json.

Parameters:
  • tokenizer_json_path (str) – Path to tokenizer.json file

  • tokenizer_config_json_path (Optional[str]) – Path to tokenizer_config.json file. Default: None

  • generation_config_path (Optional[str]) – Path to generation_config.json file. Default: None

Raises:

ValueError – If neither tokenizer_config_json_path or generation_config_path are specified.

decode(token_ids: List[int]) str[source]

Decode a list of token ids into a string.

Parameters:

token_ids (List[int]) – The list of token ids.

Returns:

The decoded string.

Return type:

str

encode(text: str, add_bos: bool = True, add_eos: bool = True) List[int][source]

Encodes a string into a list of token ids.

Parameters:
  • text (str) – The text to encode.

  • add_bos (bool) – Whether to add the tokenizer’s bos_id to the encoded string. Default True.

  • add_eos (bool) – Whether to add the tokenizer’s eos_id to the encoded string. Default True.

Returns:

The list of token ids.

Return type:

List[int]

Docs

Access comprehensive developer documentation for PyTorch

View Docs

Tutorials

Get in-depth tutorials for beginners and advanced developers

View Tutorials

Resources

Find development resources and get your questions answered

View Resources