Shortcuts

Datasets Overview

torchtune lets you fine-tune LLMs and VLMs using any dataset found on Hugging Face Hub, downloaded locally, or on a remote url. We provide built-in dataset builders to help you quickly bootstrap your fine-tuning project for workflows including instruct tuning, preference alignment, continued pretraining, and more. Beyond those, torchtune enables full customizability on your dataset pipeline, letting you train on any data format or schema.

The following tasks are supported:

Data pipeline

../_images/torchtune_datasets.svg

From raw data samples to the model inputs in the training recipe, all torchtune datasets follow the same pipeline:

  1. Raw data is queried one sample at a time from a Hugging Face dataset, local file, or remote file

  2. Message Transforms convert the raw sample which can take any format into a list of torchtune Messages. Images are contained in the message object they are associated with.

  3. Multimodal Transforms applies model-specific transforms to the messages, including tokenization (see Tokenizers), prompt templating (see Prompt Templates), image transforms, and anything else required for that particular model.

  4. The collater packages the processed samples together in a batch and the batch is passed into the model during training.

Docs

Access comprehensive developer documentation for PyTorch

View Docs

Tutorials

Get in-depth tutorials for beginners and advanced developers

View Tutorials

Resources

Find development resources and get your questions answered

View Resources