chat_dataset¶
- torchtune.datasets.chat_dataset(*, tokenizer: ModelTokenizer, source: str, conversation_style: str, chat_format: Optional[str] = None, max_seq_len: int, train_on_input: bool = False, packed: bool = False, **load_dataset_kwargs: Dict[str, Any]) ChatDataset [source]¶
Build a configurable dataset with conversations. This method should be used to configure a custom chat dataset from the yaml config instead of using
ChatDataset
directly, as it is made to be config friendly.- Parameters:
tokenizer (ModelTokenizer) – Tokenizer used by the model that implements the
tokenize_messages
method.source (str) – path string of dataset, anything supported by Hugging Face’s
load_dataset
(https://huggingface.co/docs/datasets/en/package_reference/loading_methods#datasets.load_dataset.path)conversation_style (str) – string specifying expected style of conversations in the dataset for automatic conversion to the
Message
structure. Supported styles are: “sharegpt”, “openai”chat_format (Optional[str]) – full import path of
ChatFormat
class used to format the messages. See the description inChatDataset
for more details. For a list of all possible chat formats, check out Text templates. Default: None.max_seq_len (int) – Maximum number of tokens in the returned input and label token id lists.
train_on_input (bool) – Whether the model is trained on the prompt or not. Default is False.
packed (bool) – Whether or not to pack the dataset to
max_seq_len
prior to training. Default is False.**load_dataset_kwargs (Dict[str, Any]) – additional keyword arguments to pass to
load_dataset
.
Examples
>>> from torchtune.datasets import chat_dataset >>> dataset = chat_dataset( ... tokenizer=tokenizer, ... source="HuggingFaceH4/no_robots", ... conversation_style="sharegpt", ... chat_format="torchtune.data.ChatMLFormat", ... max_seq_len=2096, ... train_on_input=True ... )
This can also be accomplished via the yaml config:
dataset: _component_: torchtune.datasets.chat_dataset source: HuggingFaceH4/no_robots conversation_style: sharegpt chat_format: torchtune.data.ChatMLFormat max_seq_len: 2096 train_on_input: True
- Returns:
- the configured
ChatDataset
or
PackedDataset
ifpacked=True
- the configured
- Return type:
- Raises:
ValueError – if the conversation format is not supported