Shortcuts

torchtune.datasets

Example datasets

torchtune supports several widely used datasets to help quickly bootstrap your fine-tuning.

alpaca_dataset

Support for family of Alpaca-style datasets from Hugging Face Datasets using the data input format and prompt template from the original alpaca codebase, where instruction, input, and output are fields from the dataset.

alpaca_cleaned_dataset

Support for family of Alpaca-style datasets from Hugging Face Datasets using the data input format and prompt template from the original alpaca codebase, where instruction, input, and output are fields from the dataset.

grammar_dataset

Support for grammar correction datasets and their variants from Hugging Face Datasets.

samsum_dataset

Support for summarization datasets and their variants from Hugging Face Datasets.

slimorca_dataset

Support for SlimOrca-style family of conversational datasets.

Generic dataset builders

torchtune also supports generic dataset builders for common formats like chat models and instruct models. These are especially useful for specifying from a YAML config.

instruct_dataset

Build a configurable dataset with instruction prompts.

chat_dataset

Build a configurable dataset with conversations.

Generic dataset classes

Class representations for the above dataset builders.

InstructDataset

Class that supports any custom dataset with instruction-based prompts and a configurable template.

ChatDataset

Class that supports any custom dataset with multiturn conversations.

ConcatDataset

A dataset class for concatenating multiple sub-datasets into a single dataset.

PackedDataset

Performs greedy sample packing on a provided dataset.

Docs

Access comprehensive developer documentation for PyTorch

View Docs

Tutorials

Get in-depth tutorials for beginners and advanced developers

View Tutorials

Resources

Find development resources and get your questions answered

View Resources