text_completion_dataset¶
- torchtune.datasets.text_completion_dataset(tokenizer: ModelTokenizer, source: str, column: str = 'text', add_eos: bool = True, packed: bool = False, split_across_pack: bool = True, split: str = 'train', filter_fn: Optional[Callable] = None, **load_dataset_kwargs: Dict[str, Any]) Union[TextCompletionDataset, PackedDataset] [source]¶
Build a configurable dataset from a freeform, unstructured text corpus similar to datasets used in pre-training. This method should be used to configure a custom text dataset from the yaml config instead of using
TextCompletionDataset
directly, as it is made to be config friendly.- Parameters:
tokenizer (ModelTokenizer) – Tokenizer used by the model that implements the
tokenize_messages
method.source (str) – path to dataset repository on Hugging Face. For local datasets, define source as the data file type (e.g. “json”, “csv”, “text”) and pass in the filepath in
data_files
. See Hugging Face’sload_dataset
(https://huggingface.co/docs/datasets/en/package_reference/loading_methods#datasets.load_dataset.path) for more details.column (str) – name of column in the sample that contains the text data. This is typically required for Hugging Face datasets or tabular data. For local datasets with a single column (e.g. unstructured txt files), use the default “text” which is used by Hugging Face datasets when loaded into memory. Default is “text”.
add_eos (bool) – Whether to add an EOS token to the end of the sequence. Default is True.
packed (bool) – Whether or not to pack the dataset to
max_seq_len
prior to training. Default is False.split_across_pack (bool) – if the last sample in a pack does not fit in
max_seq_len
, split the sample into the next pack, or move it entirely to the beginning of the next pack. For pre-training, typically this is set to True for general text completion. For fine-tuning, typically this is set to False to avoid truncating sentences in instruct tuning. This argument is ignored ifpacked=False
. Default is True.split (str) –
split
argument fordatasets.load_dataset
. You can use this argument to load a subset of a given split, e.g.split="train[:10%]"
. Default is “train”.filter_fn (Optional[Callable]) – callable used to filter the dataset prior to any pre-processing. See the Hugging Face docs for more details.
**load_dataset_kwargs (Dict[str, Any]) – additional keyword arguments to pass to
load_dataset
.
Examples
>>> from torchtune.datasets import text_completion_dataset >>> dataset = text_completion_dataset( ... tokenizer=tokenizer, ... source="allenai/c4", ... column="text", ... data_dir="realnewslike", ... packed=False, ... split="train", ... )
This can also be accomplished via the yaml config:
dataset: _component_: torchtune.datasets.text_completion_dataset source: allenai/c4 column: text data_dir: realnewslike packed: False split: train
- Returns:
- the configured
TextCompletionDataset
or
PackedDataset
ifpacked=True
- the configured
- Return type:
Union[TextCompletionDataset, PackedDataset]
- Raises:
ValueError – If
packed=True
andtokenizer.max_seq_len
is not set.