TextCompletionDataset

class torchtune.datasets.TextCompletionDataset(tokenizer: ModelTokenizer, source: str, column: str = 'text', add_eos: bool = True, filter_fn: Optional[Callable] = None, **load_dataset_kwargs: Dict[str, Any])[source]

Freeform dataset for any unstructured text corpus. Quickly load any dataset from Hugging Face or local disk and tokenize it for your model.

Parameters:

tokenizer (ModelTokenizer) – Tokenizer used by the model that implements the tokenize_messages method.
source (str) – path to dataset repository on Hugging Face. For local datasets, define source as the data file type (e.g. “json”, “csv”, “text”) and pass in the filepath in data_files. See Hugging Face’s load_dataset (https://huggingface.co/docs/datasets/en/package_reference/loading_methods#datasets.load_dataset.path) for more details.
column (str) – name of column in the sample that contains the text data. This is typically required for Hugging Face datasets or tabular data. For local datasets with a single column (e.g. unstructured txt files), use the default “text” which is used by Hugging Face datasets when loaded into memory. Default is “text”.
add_eos (bool) – Whether to add an EOS token to the end of the sequence. Default is True.
filter_fn (Optional[Callable]) – callable used to filter the dataset prior to any pre-processing. See the Hugging Face docs for more details.
**load_dataset_kwargs (Dict[str, Any]) – additional keyword arguments to pass to load_dataset. See Hugging Face’s API ref for more details.

Docs