torchtune.datasets.ChatDataset¶

torchtune.datasets.ChatDataset = <function ChatDataset>[source]¶

Note

This class is deprecated and will be removed in a future release. Please use SFTDataset or chat_dataset() for custom chat data.

Class that supports any custom dataset with multiturn conversations.

The general flow from loading a sample to tokenized prompt is: load sample -> apply transform -> foreach turn{format into template -> tokenize}

Use convert_to_messages to prepare your dataset into the Llama2 chat format and roles:

[
    Message(
        role=<system|user|assistant>,
        content=<message>,
    ),
    ...
]

This class supports multi-turn conversations. If a tokenizer sample with multiple turns does not fit within max_seq_len then it is truncated.

Parameters:

tokenizer (ModelTokenizer) – Tokenizer used by the model that implements the tokenize_messages method.
source (str) – path to dataset repository on Hugging Face. For local datasets, define source as the data file type (e.g. “json”, “csv”, “text”) and pass in the filepath in data_files. See Hugging Face’s load_dataset (https://huggingface.co/docs/datasets/en/package_reference/loading_methods#datasets.load_dataset.path) for more details.
convert_to_messages (Callable[[Mapping[str, Any]], List[Message]]) – function that keys into the desired field in the sample and converts to a list of Message that follows the Llama format with the expected keys
chat_format (Optional[ChatFormat]) – template used to format the chat. This is used to add structured text around the actual messages, such as the [INST] tags in Llama2 and in Mistral. The extra text will still get tokenized as normal text, not as special tokens. In models like Llama3 where the tokenizer adds tags as special tokens, chat_format is not needed, unless you want to structure messages in a particular way for inference.
max_seq_len (int) – Maximum number of tokens in the returned input and label token id lists.
train_on_input (bool) – Whether the model is trained on the prompt or not. Default is False.
**load_dataset_kwargs (Dict[str, Any]) – additional keyword arguments to pass to load_dataset, such as data_files or split.

Docs