PreferenceDataset¶
- class torchtune.datasets.PreferenceDataset(tokenizer: ModelTokenizer, source: str, template: InstructTemplate, transform: Optional[Callable] = None, column_map: Optional[Dict[str, str]] = None, max_seq_len: Optional[int] = None, **load_dataset_kwargs: Dict[str, Any])[source]¶
Class that supports any custom dataset with instruction-based prompts and a configurable template.
The general flow from loading a sample to tokenized prompt is: load sample -> apply transform -> format into template -> tokenize
If the column/key names differ from the expected names in the
InstructTemplate
, then thecolumn_map
argument can be used to provide this mapping.- Parameters:
tokenizer (ModelTokenizer) – Tokenizer used by the model that implements the
tokenize_messages
method.source (str) – path string of dataset, anything supported by Hugging Face’s load_dataset (https://huggingface.co/docs/datasets/en/package_reference/loading_methods#datasets.load_dataset.path)
template (InstructTemplate) – template used to format the prompt. If the placeholder variable names in the template do not match the column/key names in the dataset, use
column_map
to map them.transform (Optional[Callable]) – transform to apply to the sample before formatting to the template. Default is None.
column_map (Optional[Dict[str, str]]) – a mapping from the expected placeholder names in the template to the column/key names in the sample. If None, assume these are identical.
max_seq_len (Optional[int]) – Maximum number of tokens in the returned input and label token id lists. Default is None, disabling truncation. We recommend setting this to the highest you can fit in memory and is supported by the model. For example, llama2-7B supports up to 4096 for sequence length.
**load_dataset_kwargs (Dict[str, Any]) – additional keyword arguments to pass to
load_dataset
.