torchtune.datasets.InstructDataset¶
- torchtune.datasets.InstructDataset = <function InstructDataset>[source]¶
Note
This class is deprecated and will be removed in a future release. Please use
SFTDataset
orinstruct_dataset()
for custom instruct data.Class that supports any custom dataset with instruction-based prompts and a configurable template.
The general flow from loading a sample to tokenized prompt is: load sample -> apply transform -> format into template -> tokenize
If the column/key names differ from the expected names in the
InstructTemplate
, then thecolumn_map
argument can be used to provide this mapping.Masking of the prompt during training is controlled by the
train_on_input
flag, which is set toFalse
by default. - Iftrain_on_input
is True, the prompt is used during training and contributes to the loss. - Iftrain_on_input
is False, the prompt is masked out (tokens replaced with -100)- Parameters:
tokenizer (ModelTokenizer) – Tokenizer used by the model that implements the
tokenize_messages
method.source (str) – path to dataset repository on Hugging Face. For local datasets, define source as the data file type (e.g. “json”, “csv”, “text”) and pass in the filepath in
data_files
. See Hugging Face’sload_dataset
(https://huggingface.co/docs/datasets/en/package_reference/loading_methods#datasets.load_dataset.path) for more details.template (InstructTemplate) – template used to format the prompt. If the placeholder variable names in the template do not match the column/key names in the dataset, use
column_map
to map them.transform (Optional[Callable]) – transform to apply to the sample before formatting to the template. Default is None.
column_map (Optional[Dict[str, str]]) – a mapping from the expected placeholder names in the template to the column/key names in the sample. If None, assume these are identical. The output column can be indicated using the
output
key mapping. If no placeholder for theoutput
column is provided incolumn_map
it is assumed to beoutput
.train_on_input (bool) – Whether the model is trained on the prompt or not. Default is False.
max_seq_len (Optional[int]) – Maximum number of tokens in the returned input and label token id lists. Default is None, disabling truncation. We recommend setting this to the highest you can fit in memory and is supported by the model. For example, llama2-7B supports up to 4096 for sequence length.
**load_dataset_kwargs (Dict[str, Any]) – additional keyword arguments to pass to
load_dataset
, such asdata_files
orsplit
.
- Raises:
ValueError – If
template
is not an instance oftorchtune.data.InstructTemplate