torchtune.datasets.InstructDataset¶

torchtune.datasets.InstructDataset = <function InstructDataset>[source]¶

Note

This class is deprecated and will be removed in a future release. Please use SFTDataset or instruct_dataset() for custom instruct data.

Class that supports any custom dataset with instruction-based prompts and a configurable template.

The general flow from loading a sample to tokenized prompt is: load sample -> apply transform -> format into template -> tokenize

If the column/key names differ from the expected names in the InstructTemplate, then the column_map argument can be used to provide this mapping.

Masking of the prompt during training is controlled by the train_on_input flag, which is set to False by default. - If train_on_input is True, the prompt is used during training and contributes to the loss. - If train_on_input is False, the prompt is masked out (tokens replaced with -100)

Parameters:

tokenizer (ModelTokenizer) – Tokenizer used by the model that implements the tokenize_messages method.
source (str) – path to dataset repository on Hugging Face. For local datasets, define source as the data file type (e.g. “json”, “csv”, “text”) and pass in the filepath in data_files. See Hugging Face’s load_dataset (https://huggingface.co/docs/datasets/en/package_reference/loading_methods#datasets.load_dataset.path) for more details.
template (InstructTemplate) – template used to format the prompt. If the placeholder variable names in the template do not match the column/key names in the dataset, use column_map to map them.
transform (Optional[Callable]) – transform to apply to the sample before formatting to the template. Default is None.
column_map (Optional[Dict[str, str]]) – a mapping from the expected placeholder names in the template to the column/key names in the sample. If None, assume these are identical. The output column can be indicated using the output key mapping. If no placeholder for the output column is provided in column_map it is assumed to be output.
train_on_input (bool) – Whether the model is trained on the prompt or not. Default is False.
max_seq_len (Optional[int]) – Maximum number of tokens in the returned input and label token id lists. Default is None, disabling truncation. We recommend setting this to the highest you can fit in memory and is supported by the model. For example, llama2-7B supports up to 4096 for sequence length.
**load_dataset_kwargs (Dict[str, Any]) – additional keyword arguments to pass to load_dataset, such as data_files or split.

Raises:

ValueError – If template is not an instance of torchtune.data.InstructTemplate

torchtune.datasets.InstructDataset¶

Docs

Tutorials

Resources