PreferenceDataset

class torchtune.datasets.PreferenceDataset(tokenizer: ModelTokenizer, source: str, template: InstructTemplate, transform: Optional[Callable] = None, column_map: Optional[Dict[str, str]] = None, max_seq_len: Optional[int] = None, **load_dataset_kwargs: Dict[str, Any])[source]

Class that supports any custom dataset with instruction-based prompts and a configurable template.

The general flow from loading a sample to tokenized prompt is: load sample -> apply transform -> format into template -> tokenize

If the column/key names differ from the expected names in the InstructTemplate, then the column_map argument can be used to provide this mapping.

Parameters:

tokenizer (ModelTokenizer) – Tokenizer used by the model that implements the tokenize_messages method.
source (str) – path string of dataset, anything supported by Hugging Face’s load_dataset (https://huggingface.co/docs/datasets/en/package_reference/loading_methods#datasets.load_dataset.path)
template (InstructTemplate) – template used to format the prompt. If the placeholder variable names in the template do not match the column/key names in the dataset, use column_map to map them.
transform (Optional[Callable]) – transform to apply to the sample before formatting to the template. Default is None.
column_map (Optional[Dict[str, str]]) – a mapping from the expected placeholder names in the template to the column/key names in the sample. If None, assume these are identical.
max_seq_len (Optional[int]) – Maximum number of tokens in the returned input and label token id lists. Default is None, disabling truncation. We recommend setting this to the highest you can fit in memory and is supported by the model. For example, llama2-7B supports up to 4096 for sequence length.
**load_dataset_kwargs (Dict[str, Any]) – additional keyword arguments to pass to load_dataset.

Docs