Table of Contents

Shortcuts

torchtune.datasets

For a detailed general usage guide, please see our datasets tutorial.

Example datasets

torchtune supports several widely used datasets to help quickly bootstrap your fine-tuning.

`alpaca_dataset`	Support for family of Alpaca-style datasets from Hugging Face Datasets using the data input format and prompt template from the original alpaca codebase, where `instruction`, `input`, and `output` are fields from the dataset.
`alpaca_cleaned_dataset`	Builder for a variant of Alpaca-style datasets with the cleaned version of the original Alpaca dataset, yahma/alpaca-cleaned.
`grammar_dataset`	Support for grammar correction datasets and their variants from Hugging Face Datasets.
`samsum_dataset`	Support for summarization datasets and their variants from Hugging Face Datasets.
`slimorca_dataset`	Support for SlimOrca-style family of conversational datasets.
`stack_exchanged_paired_dataset`	Family of preference datasets similar to StackExchangePaired data.
`cnn_dailymail_articles_dataset`	Support for family of datasets similar to CNN / DailyMail, a corpus of news articles.
`wikitext_dataset`	Support for family of datasets similar to wikitext, an unstructured text corpus consisting of articles from Wikipedia.

Generic dataset builders

torchtune also supports generic dataset builders for common formats like chat models and instruct models. These are especially useful for specifying from a YAML config.

`instruct_dataset`	Build a configurable dataset with instruction prompts.
`chat_dataset`	Build a configurable dataset with conversations.
`text_completion_dataset`	Build a configurable dataset from a freeform, unstructured text corpus similar to datasets used in pre-training.

Generic dataset classes

Class representations for the above dataset builders.

`InstructDataset`	Class that supports any custom dataset with instruction-based prompts and a configurable template.
`ChatDataset`	Class that supports any custom dataset with multiturn conversations.
`TextCompletionDataset`	Freeform dataset for any unstructured text corpus.
`ConcatDataset`	A dataset class for concatenating multiple sub-datasets into a single dataset.
`PackedDataset`	Performs greedy sample packing on a provided dataset.
`PreferenceDataset`	Class that supports any custom dataset with instruction-based prompts and a configurable template.

Docs

Access comprehensive developer documentation for PyTorch

View Docs

Tutorials

Get in-depth tutorials for beginners and advanced developers

View Tutorials

Resources

Find development resources and get your questions answered

View Resources