text_completion_dataset

torchtune.datasets.text_completion_dataset(tokenizer: ModelTokenizer, source: str, column: str = 'text', add_eos: bool = True, packed: bool = False, split_across_pack: bool = True, split: str = 'train', filter_fn: Optional[Callable] = None, **load_dataset_kwargs: Dict[str, Any]) → Union[TextCompletionDataset, PackedDataset][source]

Build a configurable dataset from a freeform, unstructured text corpus similar to datasets used in pre-training. This method should be used to configure a custom text dataset from the yaml config instead of using TextCompletionDataset directly, as it is made to be config friendly.

Parameters:

tokenizer (ModelTokenizer) – Tokenizer used by the model that implements the tokenize_messages method.
source (str) – path to dataset repository on Hugging Face. For local datasets, define source as the data file type (e.g. “json”, “csv”, “text”) and pass in the filepath in data_files. See Hugging Face’s load_dataset (https://huggingface.co/docs/datasets/en/package_reference/loading_methods#datasets.load_dataset.path) for more details.
column (str) – name of column in the sample that contains the text data. This is typically required for Hugging Face datasets or tabular data. For local datasets with a single column (e.g. unstructured txt files), use the default “text” which is used by Hugging Face datasets when loaded into memory. Default is “text”.
add_eos (bool) – Whether to add an EOS token to the end of the sequence. Default is True.
packed (bool) – Whether or not to pack the dataset to max_seq_len prior to training. Default is False.
split_across_pack (bool) – if the last sample in a pack does not fit in max_seq_len, split the sample into the next pack, or move it entirely to the beginning of the next pack. For pre-training, typically this is set to True for general text completion. For fine-tuning, typically this is set to False to avoid truncating sentences in instruct tuning. This argument is ignored if packed=False. Default is True.
split (str) – split argument for datasets.load_dataset. You can use this argument to load a subset of a given split, e.g. split="train[:10%]". Default is “train”.
filter_fn (Optional[Callable]) – callable used to filter the dataset prior to any pre-processing. See the Hugging Face docs for more details.
**load_dataset_kwargs (Dict[str, Any]) – additional keyword arguments to pass to load_dataset.

Examples

>>> from torchtune.datasets import text_completion_dataset
>>> dataset = text_completion_dataset(
...   tokenizer=tokenizer,
...   source="allenai/c4",
...   column="text",
...   data_dir="realnewslike",
...   packed=False,
...   split="train",
... )

This can also be accomplished via the yaml config:

dataset:
    _component_: torchtune.datasets.text_completion_dataset
    source: allenai/c4
    column: text
    data_dir: realnewslike
    packed: False
    split: train

Returns:

the configured TextCompletionDataset: or PackedDataset if packed=True

Return type:

Union[TextCompletionDataset, PackedDataset]

Raises:

ValueError – If packed=True and tokenizer.max_seq_len is not set.

text_completion_dataset

Docs

Tutorials

Resources