torchtune.datasets¶
For a detailed general usage guide, please see our datasets tutorial.
Example datasets¶
torchtune supports several widely used datasets to help quickly bootstrap your fine-tuning.
Support for family of Alpaca-style datasets from Hugging Face Datasets using the data input format and prompt template from the original alpaca codebase, where |
|
Builder for a variant of Alpaca-style datasets with the cleaned version of the original Alpaca dataset, yahma/alpaca-cleaned. |
|
Support for grammar correction datasets and their variants from Hugging Face Datasets. |
|
Support for summarization datasets and their variants from Hugging Face Datasets. |
|
Support for SlimOrca-style family of conversational datasets. |
|
Family of preference datasets similar to StackExchangePaired data. |
|
Support for family of datasets similar to CNN / DailyMail, a corpus of news articles. |
|
Support for family of datasets similar to wikitext, an unstructured text corpus consisting of articles from Wikipedia. |
Generic dataset builders¶
torchtune also supports generic dataset builders for common formats like chat models and instruct models. These are especially useful for specifying from a YAML config.
Build a configurable dataset with instruction prompts. |
|
Build a configurable dataset with conversations. |
|
Build a configurable dataset from a freeform, unstructured text corpus similar to datasets used in pre-training. |
Generic dataset classes¶
Class representations for the above dataset builders.
Class that supports any custom dataset with instruction-based prompts and a configurable template. |
|
Class that supports any custom dataset with multiturn conversations. |
|
Freeform dataset for any unstructured text corpus. |
|
A dataset class for concatenating multiple sub-datasets into a single dataset. |
|
Performs greedy sample packing on a provided dataset. |
|
Class that supports any custom dataset with instruction-based prompts and a configurable template. |