torchtune.datasets¶
For a detailed general usage guide, please see our datasets tutorial.
Text datasets¶
torchtune supports several widely used text-only datasets to help quickly bootstrap your fine-tuning.
Support for family of Alpaca-style datasets from Hugging Face Datasets using the data input format and prompt template from the original alpaca codebase, where |
|
Builder for a variant of Alpaca-style datasets with the cleaned version of the original Alpaca dataset, yahma/alpaca-cleaned. |
|
Support for grammar correction datasets and their variants from Hugging Face Datasets. |
|
Constructs preference datasets similar to Anthropic's helpful/harmless RLHF data. |
|
Support for summarization datasets and their variants from Hugging Face Datasets. |
|
Support for SlimOrca-style family of conversational datasets. |
|
Family of preference datasets similar to the Stack Exchange Paired dataset. |
|
Support for family of datasets similar to CNN / DailyMail, a corpus of news articles. |
|
Support for family of datasets similar to wikitext, an unstructured text corpus consisting of fulls articles from Wikipedia. |
Image + Text datasets¶
Support for family of image + text datasets similar to LLaVA-Instruct-150K from Hugging Face Datasets. |
|
Support for family of image + text datasets similar to The Cauldron from Hugging Face Datasets. |
Generic dataset builders¶
torchtune also supports generic dataset builders for common formats like chat models and instruct models. These are especially useful for specifying from a YAML config.
Configure a custom dataset with user instruction prompts and model responses. |
|
Configure a custom dataset with conversations between user and model assistant. |
|
Configures a custom preference dataset comprising interactions between user and model assistant. |
|
Build a configurable dataset from a freeform, unstructured text corpus similar to datasets used in pre-training. |
Generic dataset classes¶
Class representations for the above dataset builders.
Freeform dataset for any unstructured text corpus. |
|
A dataset class for concatenating multiple sub-datasets into a single dataset. |
|
Performs greedy sample packing on a provided dataset. |
|
Primary class for fine-tuning via preference modelling techniques (e.g. training a preference model for RLHF, or directly optimizing a model through DPO) on a preference dataset sourced from Hugging Face Hub, local files, or remote files. This class requires the dataset to have "chosen" and "rejected" model responses. These are typically either full conversations between user and assistant in separate columns::. |
|
Primary class for creating any dataset for supervised fine-tuning either from Hugging Face Hub, local files, or remote files. |