chat_dataset

torchtune.datasets.chat_dataset(*, tokenizer: ModelTokenizer, source: str, conversation_style: str, chat_format: Optional[str] = None, max_seq_len: int, train_on_input: bool = False, packed: bool = False, **load_dataset_kwargs: Dict[str, Any]) → ChatDataset[source]

Build a configurable dataset with conversations. This method should be used to configure a custom chat dataset from the yaml config instead of using ChatDataset directly, as it is made to be config friendly.

Parameters:

tokenizer (ModelTokenizer) – Tokenizer used by the model that implements the tokenize_messages method.
source (str) – path string of dataset, anything supported by Hugging Face’s load_dataset (https://huggingface.co/docs/datasets/en/package_reference/loading_methods#datasets.load_dataset.path)
conversation_style (str) – string specifying expected style of conversations in the dataset for automatic conversion to the Message structure. Supported styles are: “sharegpt”, “openai”
chat_format (Optional[str]) – full import path of ChatFormat class used to format the messages. See the description in ChatDataset for more details. For a list of all possible chat formats, check out Text templates. Default: None.
max_seq_len (int) – Maximum number of tokens in the returned input and label token id lists.
train_on_input (bool) – Whether the model is trained on the prompt or not. Default is False.
packed (bool) – Whether or not to pack the dataset to max_seq_len prior to training. Default is False.
**load_dataset_kwargs (Dict[str, Any]) – additional keyword arguments to pass to load_dataset.

Examples

>>> from torchtune.datasets import chat_dataset
>>> dataset = chat_dataset(
...   tokenizer=tokenizer,
...   source="HuggingFaceH4/no_robots",
...   conversation_style="sharegpt",
...   chat_format="torchtune.data.ChatMLFormat",
...   max_seq_len=2096,
...   train_on_input=True
... )

This can also be accomplished via the yaml config:

dataset:
    _component_: torchtune.datasets.chat_dataset
    source: HuggingFaceH4/no_robots
    conversation_style: sharegpt
    chat_format: torchtune.data.ChatMLFormat
    max_seq_len: 2096
    train_on_input: True

Returns:

the configured ChatDataset: or PackedDataset if packed=True

Return type:

ChatDataset or PackedDataset

Raises:

ValueError – if the conversation format is not supported

chat_dataset

Docs

Tutorials

Resources