

torchtune.datasets.samsum_dataset(tokenizer: ModelTokenizer, *, source: str = 'samsum', train_on_input: bool = False, packed: bool = False) InstructDataset[source]

Support for summarization datasets and their variants from Hugging Face Datasets. An example is the SAMsum dataset.

The prompt template mirrors what is used in the llama_recipes codebase

where dialogue and summary are fields from the dataset.

Masking of the prompt during training is controlled by the train_on_input flag, which is set to False by default - If train_on_input is True, the prompt is used during training and contributes to the loss. - If train_on_input is False, the prompt is masked out (tokens replaced with -100)

  • tokenizer (ModelTokenizer) – Tokenizer used by the model that implements the tokenize_messages method.

  • source (str) – path string of dataset, anything supported by Hugging Face’s load_dataset.

  • train_on_input (bool) – Whether the model is trained on the prompt or not. Default is False.

  • packed (bool) – Whether or not to pack the dataset to max_seq_len prior to training. Default is False.


dataset configured with source data and template

Return type:



>>> samsum_ds = samsum_dataset(tokenizer=tokenizer)
>>> for batch in Dataloader(samsum_ds, batch_size=8):
>>>     print(f"Batch size: {len(batch)}")
>>> Batch size: 8


Access comprehensive developer documentation for PyTorch

View Docs


Get in-depth tutorials for beginners and advanced developers

View Tutorials


Find development resources and get your questions answered

View Resources