samsum_dataset¶
- torchtune.datasets.samsum_dataset(tokenizer: Tokenizer, source: str = 'samsum', train_on_input: bool = False) InstructDataset [source]¶
Support for summarization datasets and their variants from Hugging Face Datasets. An example is the SAMsum dataset.
The prompt template mirrors what is used in the llama_recipes codebase
where dialogue and summary are fields from the dataset.
Masking of the prompt during training is controlled by the train_on_input flag, which is set to False by default - If train_on_input is True, the prompt is used during training and contributes to the loss. - If train_on_input is False, the prompt is masked out (tokens replaced with -100)
- Parameters:
- Returns:
dataset configured with source data and template
- Return type:
InstructDataset
Example
>>> samsum_ds = samsum_dataset(tokenizer=tokenizer) >>> for batch in Dataloader(samsum_ds, batch_size=8): >>> print(f"Batch size: {len(batch)}") >>> Batch size: 8