torchtune.datasets¶
Example datasets¶
torchtune supports several widely used datasets to help quickly bootstrap your fine-tuning.
Support for family of Alpaca-style datasets from Hugging Face Datasets using the data input format and prompt template from the original alpaca codebase, where instruction, input, and output are fields from the dataset. |
|
Support for family of Alpaca-style datasets from Hugging Face Datasets using the data input format and prompt template from the original alpaca codebase, where instruction, input, and output are fields from the dataset. |
|
Support for grammar correction datasets and their variants from Hugging Face Datasets. |
|
Support for summarization datasets and their variants from Hugging Face Datasets. |
|
Support for SlimOrca-style family of conversational datasets. |
Generic dataset builders¶
torchtune also supports generic dataset builders for common formats like chat models and instruct models. These are especially useful for specifying from a YAML config.
Build a configurable dataset with instruction prompts. |
|
Build a configurable dataset with conversations. |
Generic dataset classes¶
Class representations for the above dataset builders.
Class that supports any custom dataset with instruction-based prompts and a configurable template. |
|
Class that supports any custom dataset with multiturn conversations. |
|
A dataset class for concatenating multiple sub-datasets into a single dataset. |
|
Performs greedy sample packing on a provided dataset. |