Datasets Overview

torchtune lets you fine-tune LLMs and VLMs using any dataset found on Hugging Face Hub, downloaded locally, or on a remote url. We provide built-in dataset builders to help you quickly bootstrap your fine-tuning project for workflows including instruct tuning, preference alignment, continued pretraining, and more. Beyond those, torchtune enables full customizability on your dataset pipeline, letting you train on any data format or schema.

The following tasks are supported:

Text supervised fine-tuning
- Instruct Datasets
- Chat Datasets
Multimodal supervised fine-tuning
- Multimodal Datasets
RLHF
- Preference Datasets
Continued pre-training
- Text-completion Datasets

Data pipeline

From raw data samples to the model inputs in the training recipe, all torchtune datasets follow the same pipeline:

Raw data is queried one sample at a time from a Hugging Face dataset, local file, or remote file
Message Transforms convert the raw sample which can take any format into a list of torchtune Messages. Images are contained in the message object they are associated with.
Multimodal Transforms applies model-specific transforms to the messages, including tokenization (see Tokenizers), prompt templating (see Prompt Templates), image transforms, and anything else required for that particular model.
The collater packages the processed samples together in a batch and the batch is passed into the model during training.

Datasets Overview

Data pipeline

Docs

Tutorials

Resources