grammar_dataset¶
- torchtune.datasets.grammar_dataset(tokenizer: ModelTokenizer, *, source: str = 'liweili/c4_200m', train_on_input: bool = False, packed: bool = False) InstructDataset [source]¶
Support for grammar correction datasets and their variants from Hugging Face Datasets. Here is an example of a grammar correction dataset.
The prompt template mirrors what is used in the llama_recipes codebase
where
input
andoutput
are fields from the dataset.Masking of the prompt during training is controlled by the
train_on_input
flag, which is set toFalse
by default - Iftrain_on_input
is True, the prompt is used during training and contributes to the loss. - Iftrain_on_input
is False, the prompt is masked out (tokens replaced with -100)- Parameters:
tokenizer (ModelTokenizer) – Tokenizer used by the model that implements the
tokenize_messages
method.source (str) – path string of dataset, anything supported by Hugging Face’s load_dataset.
train_on_input (bool) – Whether the model is trained on the prompt or not. Default is False.
packed (bool) – Whether or not to pack the dataset to
max_seq_len
prior to training. Default is False.
- Returns:
dataset configured with source data and template
- Return type:
Example
>>> grammar_ds = grammar_dataset(tokenizer=tokenizer) >>> for batch in Dataloader(grammar_ds, batch_size=8): >>> print(f"Batch size: {len(batch)}") >>> Batch size: 8