Shortcuts

BaseTokenizer

class torchtune.modules.tokenizers.BaseTokenizer(*args, **kwargs)[source]

Abstract token encoding model that implements encode and decode methods. See SentencePieceBaseTokenizer and TikTokenBaseTokenizer for example implementations of this protocol.

decode(token_ids: List[int], **kwargs: Dict[str, Any]) str[source]

Given a list of token ids, return the decoded text, optionally including special tokens.

Parameters:
  • token_ids (List[int]) – The list of token ids to decode.

  • **kwargs (Dict[str, Any]) – kwargs.

Returns:

The decoded text.

Return type:

str

encode(text: str, **kwargs: Dict[str, Any]) List[int][source]

Given a string, return the encoded list of token ids.

Parameters:
  • text (str) – The text to encode.

  • **kwargs (Dict[str, Any]) – kwargs.

Returns:

The encoded list of token ids.

Return type:

List[int]

Docs

Access comprehensive developer documentation for PyTorch

View Docs

Tutorials

Get in-depth tutorials for beginners and advanced developers

View Tutorials

Resources

Find development resources and get your questions answered

View Resources