Shortcuts

torchtune.datasets

For a detailed general usage guide, please see Datasets Overview.

Text datasets

torchtune supports several widely used text-only datasets to help quickly bootstrap your fine-tuning.

alpaca_dataset

Support for family of Alpaca-style datasets from Hugging Face Datasets using the data input format and prompt template from the original alpaca codebase, where instruction, input, and output are fields from the dataset.

alpaca_cleaned_dataset

Builder for a variant of Alpaca-style datasets with the cleaned version of the original Alpaca dataset, yahma/alpaca-cleaned.

grammar_dataset

Support for grammar correction datasets and their variants from Hugging Face Datasets.

hh_rlhf_helpful_dataset

Constructs preference datasets similar to Anthropic's helpful/harmless RLHF data.

samsum_dataset

Support for summarization datasets and their variants from Hugging Face Datasets.

slimorca_dataset

Support for SlimOrca-style family of conversational datasets.

stack_exchange_paired_dataset

Family of preference datasets similar to the Stack Exchange Paired dataset.

cnn_dailymail_articles_dataset

Support for family of datasets similar to CNN / DailyMail, a corpus of news articles.

wikitext_dataset

Support for family of datasets similar to wikitext, an unstructured text corpus consisting of fulls articles from Wikipedia.

Image + Text datasets

multimodal.llava_instruct_dataset

Support for family of image + text datasets similar to LLaVA-Instruct-150K from Hugging Face Datasets.

multimodal.the_cauldron_dataset

Support for family of image + text datasets similar to The Cauldron from Hugging Face Datasets.

multimodal.vqa_dataset

Configure a custom visual question answer dataset with separate columns for user question, image, and model response.

Generic dataset builders

torchtune also supports generic dataset builders for common formats like chat models and instruct models. These are especially useful for specifying from a YAML config.

instruct_dataset

Configure a custom dataset with user instruction prompts and model responses.

chat_dataset

Configure a custom dataset with conversations between user and model assistant.

preference_dataset

Configures a custom preference dataset comprising interactions between user and model assistant.

text_completion_dataset

Build a configurable dataset from a freeform, unstructured text corpus similar to datasets used in pre-training.

Generic dataset classes

Class representations for the above dataset builders.

TextCompletionDataset

Freeform dataset for any unstructured text corpus.

ConcatDataset

A dataset class for concatenating multiple sub-datasets into a single dataset.

PackedDataset

Performs greedy sample packing on a provided dataset.

PreferenceDataset

Primary class for fine-tuning via preference modelling techniques (e.g. training a preference model for RLHF, or directly optimizing a model through DPO) on a preference dataset sourced from Hugging Face Hub, local files, or remote files. This class requires the dataset to have "chosen" and "rejected" model responses. These are typically either full conversations between user and assistant in separate columns::.

SFTDataset

Primary class for creating any dataset for supervised fine-tuning either from Hugging Face Hub, local files, or remote files.

Docs

Access comprehensive developer documentation for PyTorch

View Docs

Tutorials

Get in-depth tutorials for beginners and advanced developers

View Tutorials

Resources

Find development resources and get your questions answered

View Resources