Shortcuts

TextCompletionDataset

class torchtune.datasets.TextCompletionDataset(tokenizer: ModelTokenizer, source: str, column: str = 'text', add_eos: bool = True, filter_fn: Optional[Callable] = None, **load_dataset_kwargs: Dict[str, Any])[source]

Freeform dataset for any unstructured text corpus. Quickly load any dataset from Hugging Face or local disk and tokenize it for your model.

Parameters:
  • tokenizer (ModelTokenizer) – Tokenizer used by the model that implements the tokenize_messages method.

  • source (str) – path to dataset repository on Hugging Face. For local datasets, define source as the data file type (e.g. “json”, “csv”, “text”) and pass in the filepath in data_files. See Hugging Face’s load_dataset (https://huggingface.co/docs/datasets/en/package_reference/loading_methods#datasets.load_dataset.path) for more details.

  • column (str) – name of column in the sample that contains the text data. This is typically required for Hugging Face datasets or tabular data. For local datasets with a single column (e.g. unstructured txt files), use the default “text” which is used by Hugging Face datasets when loaded into memory. Default is “text”.

  • add_eos (bool) – Whether to add an EOS token to the end of the sequence. Default is True.

  • filter_fn (Optional[Callable]) – callable used to filter the dataset prior to any pre-processing. See the Hugging Face docs for more details.

  • **load_dataset_kwargs (Dict[str, Any]) – additional keyword arguments to pass to load_dataset, such as data_files or split.

Docs

Access comprehensive developer documentation for PyTorch

View Docs

Tutorials

Get in-depth tutorials for beginners and advanced developers

View Tutorials

Resources

Find development resources and get your questions answered

View Resources