Shortcuts

cnn_dailymail_articles_dataset

torchtune.datasets.cnn_dailymail_articles_dataset(tokenizer: ModelTokenizer, source: str = 'ccdv/cnn_dailymail', max_seq_len: Optional[int] = None, filter_fn: Optional[Callable] = None, split: str = 'train', **load_dataset_kwargs: Dict[str, Any]) TextCompletionDataset[source]

Support for family of datasets similar to CNN / DailyMail, a corpus of news articles. This builder only extracts the articles and not the highlights for general text completion tasks.

Parameters:
  • tokenizer (ModelTokenizer) – Tokenizer used by the model that implements the tokenize_messages method.

  • source (str) – path string of dataset, anything supported by Hugging Face’s load_dataset (https://huggingface.co/docs/datasets/en/package_reference/loading_methods#datasets.load_dataset.path)

  • max_seq_len (Optional[int]) – Maximum number of tokens in the returned input and label token id lists. Default is None, disabling truncation. We recommend setting this to the highest you can fit in memory and is supported by the model. For example, llama2-7B supports up to 4096 for sequence length.

  • filter_fn (Optional[Callable]) – callable used to filter the dataset prior to any pre-processing. See the Hugging Face docs for more details.

  • split (str) – split argument for datasets.load_dataset. You can use this argument to load a subset of a given split, e.g. split="train[:10%]". Default is “train”.

  • **load_dataset_kwargs (Dict[str, Any]) – additional keyword arguments to pass to load_dataset.

Returns:

the configured TextCompletionDataset

Return type:

TextCompletionDataset

Docs

Access comprehensive developer documentation for PyTorch

View Docs

Tutorials

Get in-depth tutorials for beginners and advanced developers

View Tutorials

Resources

Find development resources and get your questions answered

View Resources