text_completion_dataset

torchtune.datasets.text_completion_dataset(tokenizer: ModelTokenizer, source: str, column: Optional[str] = None, max_seq_len: Optional[int] = None, packed: bool = False, **load_dataset_kwargs: Dict[str, Any]) → TextCompletionDataset[source]

Build a configurable dataset from a freeform, unstructured text corpus similar to datasets used in pre-training. This method should be used to configure a custom text dataset from the yaml config instead of using TextCompletionDataset directly, as it is made to be config friendly.

Parameters:

tokenizer (ModelTokenizer) – Tokenizer used by the model that implements the tokenize_messages method.
source (str) – path string of dataset, anything supported by Hugging Face’s load_dataset (https://huggingface.co/docs/datasets/en/package_reference/loading_methods#datasets.load_dataset.path)
column (Optional[str]) – name of column in the sample that contains the text data. This is typically required for Hugging Face datasets or tabular data, but can be omitted for local datasets. Default is None.
max_seq_len (Optional[int]) – Maximum number of tokens in the returned input and label token id lists. Default is None, disabling truncation. We recommend setting this to the highest you can fit in memory and is supported by the model. For example, llama2-7B supports up to 4096 for sequence length.
packed (bool) – Whether or not to pack the dataset to max_seq_len prior to training. Default is False.
**load_dataset_kwargs (Dict[str, Any]) – additional keyword arguments to pass to load_dataset.

Examples

>>> from torchtune.datasets import text_completion_dataset
>>> dataset = text_completion_dataset(
...   tokenizer=tokenizer,
...   source="allenai/c4",
...   column="text",
...   max_seq_len=2096,
...   data_dir="realnewslike",
...   packed=False,
... )

This can also be accomplished via the yaml config:

dataset:
    _component_: torchtune.datasets.text_completion_dataset
    source: allenai/c4
    column: text
    max_seq_len: 2096
    data_dir: realnewslike
    packed: False

Returns:

the configured TextCompletionDataset: or PackedDataset if packed=True

Return type:

TextCompletionDataset or PackedDataset

text_completion_dataset

Docs

Tutorials

Resources