Shortcuts

torchtune.datasets

For a detailed general usage guide, please see our datasets tutorial.

Example datasets

torchtune supports several widely used datasets to help quickly bootstrap your fine-tuning.

alpaca_dataset

Support for family of Alpaca-style datasets from Hugging Face Datasets using the data input format and prompt template from the original alpaca codebase, where instruction, input, and output are fields from the dataset.

alpaca_cleaned_dataset

Builder for a variant of Alpaca-style datasets with the cleaned version of the original Alpaca dataset, yahma/alpaca-cleaned.

grammar_dataset

Support for grammar correction datasets and their variants from Hugging Face Datasets.

samsum_dataset

Support for summarization datasets and their variants from Hugging Face Datasets.

slimorca_dataset

Support for SlimOrca-style family of conversational datasets.

stack_exchanged_paired_dataset

Family of preference datasets similar to StackExchangePaired data.

cnn_dailymail_articles_dataset

Support for family of datasets similar to CNN / DailyMail, a corpus of news articles.

wikitext_dataset

Support for family of datasets similar to wikitext, an unstructured text corpus consisting of articles from Wikipedia.

Generic dataset builders

torchtune also supports generic dataset builders for common formats like chat models and instruct models. These are especially useful for specifying from a YAML config.

instruct_dataset

Build a configurable dataset with instruction prompts.

chat_dataset

Build a configurable dataset with conversations.

text_completion_dataset

Build a configurable dataset from a freeform, unstructured text corpus similar to datasets used in pre-training.

Generic dataset classes

Class representations for the above dataset builders.

InstructDataset

Class that supports any custom dataset with instruction-based prompts and a configurable template.

ChatDataset

Class that supports any custom dataset with multiturn conversations.

TextCompletionDataset

Freeform dataset for any unstructured text corpus.

ConcatDataset

A dataset class for concatenating multiple sub-datasets into a single dataset.

PackedDataset

Performs greedy sample packing on a provided dataset.

PreferenceDataset

Class that supports any custom dataset with instruction-based prompts and a configurable template.

Docs

Access comprehensive developer documentation for PyTorch

View Docs

Tutorials

Get in-depth tutorials for beginners and advanced developers

View Tutorials

Resources

Find development resources and get your questions answered

View Resources