cnn_dailymail_articles_dataset

torchtune.datasets.cnn_dailymail_articles_dataset(tokenizer: ModelTokenizer, source: str = 'ccdv/cnn_dailymail', max_seq_len: Optional[int] = None, filter_fn: Optional[Callable] = None, split: str = 'train', **load_dataset_kwargs: Dict[str, Any]) → TextCompletionDataset[source]

Support for family of datasets similar to CNN / DailyMail, a corpus of news articles. This builder only extracts the articles and not the highlights for general text completion tasks.

Parameters:

tokenizer (ModelTokenizer) – Tokenizer used by the model that implements the tokenize_messages method.
source (str) – path string of dataset, anything supported by Hugging Face’s load_dataset (https://huggingface.co/docs/datasets/en/package_reference/loading_methods#datasets.load_dataset.path)
max_seq_len (Optional[int]) – Maximum number of tokens in the returned input and label token id lists. Default is None, disabling truncation. We recommend setting this to the highest you can fit in memory and is supported by the model. For example, llama2-7B supports up to 4096 for sequence length.
filter_fn (Optional[Callable]) – callable used to filter the dataset prior to any pre-processing. See the Hugging Face docs for more details.
split (str) – split argument for datasets.load_dataset. You can use this argument to load a subset of a given split, e.g. split="train[:10%]". Default is “train”.
**load_dataset_kwargs (Dict[str, Any]) – additional keyword arguments to pass to load_dataset.

Returns:

the configured TextCompletionDataset

Return type:

TextCompletionDataset

cnn_dailymail_articles_dataset

Docs

Tutorials

Resources