TextCompletionDataset¶
- class torchtune.datasets.TextCompletionDataset(tokenizer: ModelTokenizer, source: str, column: str = 'text', max_seq_len: Optional[int] = None, **load_dataset_kwargs: Dict[str, Any])[source]¶
Freeform dataset for any unstructured text corpus. Quickly load any dataset from Hugging Face or local disk and tokenize it for your model.
- Parameters:
tokenizer (ModelTokenizer) – Tokenizer used by the model that implements the
tokenize_messages
method.source (str) – path string of dataset, anything supported by Hugging Face’s
load_dataset
(https://huggingface.co/docs/datasets/en/package_reference/loading_methods#datasets.load_dataset.path)column (str) – name of column in the sample that contains the text data. This is typically required for Hugging Face datasets or tabular data. For local datasets with a single column, use the default “text”, which is what is assigned by Hugging Face datasets when loaded into memory. Default is “text”.
max_seq_len (Optional[int]) – Maximum number of tokens in the returned input and label token id lists. Default is None, disabling truncation. We recommend setting this to the highest you can fit in memory and is supported by the model. For example, llama2-7B supports up to 4096 for sequence length.
**load_dataset_kwargs (Dict[str, Any]) – additional keyword arguments to pass to
load_dataset
.