Instruct Datasets¶
Instruction tuning involves training an LLM to perform specific task(s). This typically takes the form of a user command or prompt and the assistant’s response, along with an optional system prompt that describes the task at hand. This is more structured than freeform text association that models are typically pre-trained with, where they learn to specifically predict the next token instead of completing the task.
The primary entry point for fine-tuning with instruct datasets in torchtune is the instruct_dataset()
builder. This lets you specify a local or Hugging Face dataset that follows the instruct data format
directly from the config and train your LLM on it.
Example instruct dataset¶
Here is an example of an instruct dataset to fine-tune for a grammar correction task.
head data/my_data.csv
# incorrect,correct
# This are a cat,This is a cat.
from torchtune.models.gemma import gemma_tokenizer
from torchtune.datasets import instruct_dataset
g_tokenizer = gemma_tokenizer(
path="/tmp/gemma-7b/tokenizer.model",
prompt_template="torchtune.data.GrammarErrorCorrectionTemplate",
max_seq_len=8192,
)
ds = instruct_dataset(
tokenizer=g_tokenizer,
source="csv",
data_files="data/my_data.csv",
split="train",
# By default, user prompt is ignored in loss. Set to True to include it
train_on_input=True,
# Prepend a system message to every sample
new_system_prompt="You are an AI assistant. ",
# Use columns in our dataset instead of default
column_map={"input": "incorrect", "output": "correct"},
)
tokenized_dict = ds[0]
tokens, labels = tokenized_dict["tokens"], tokenized_dict["labels"]
print(g_tokenizer.decode(tokens))
# You are an AI assistant. Correct this to standard English:This are a cat---\nCorrected:This is a cat.
print(labels) # System message is masked out, but not user message
# [-100, -100, -100, -100, -100, -100, 27957, 736, 577, ...]
# In config
tokenizer:
_component_: torchtune.models.gemma.gemma_tokenizer
path: /tmp/gemma-7b/tokenizer.model
prompt_template: torchtune.data.GrammarErrorCorrectionTemplate
max_seq_len: 8192
dataset:
source: csv
data_files: data/my_data.csv
split: train
train_on_input: True
new_system_prompt: You are an AI assistant.
column_map:
input: incorrect
output: correct
Instruct dataset format¶
Instruct datasets are expected to follow an input-output format, where the user prompt is in one column and the assistant prompt is in another column.
| input | output |
|-----------------|------------------|
| "user prompt" | "model response" |
As an example, you can see the schema of the C4 200M dataset.
Loading instruct datasets from Hugging Face¶
You simply need to pass in the dataset repo name to source
, which is then passed into Hugging Face’s load_dataset
.
For most datasets, you will also need to specify the split
.
# In code
from torchtune.models.gemma import gemma_tokenizer
from torchtune.datasets import instruct_dataset
g_tokenizer = gemma_tokenizer("/tmp/gemma-7b/tokenizer.model")
ds = instruct_dataset(
tokenizer=g_tokenizer,
source="liweili/c4_200m",
split="train"
)
# In config
tokenizer:
_component_: torchtune.models.gemma.gemma_tokenizer
path: /tmp/gemma-7b/tokenizer.model
# Tokenizer is passed into the dataset in the recipe
dataset:
_component_: torchtune.datasets.instruct_dataset
source: liweili/c4_200m
split: train
This will use the default column names “input” and “output”. To change the column names, use the column_map
argument (see Renaming columns).
Loading local and remote instruct datasets¶
To load in a local or remote dataset via https that follows the instruct format, you need to specify the source
, data_files
and split
arguments. See Hugging Face’s load_dataset
documentation
for more details on loading local or remote files.
# In code
from torchtune.models.gemma import gemma_tokenizer
from torchtune.datasets import instruct_dataset
g_tokenizer = gemma_tokenizer("/tmp/gemma-7b/tokenizer.model")
ds = instruct_dataset(
tokenizer=g_tokenizer,
source="json",
data_files="data/my_data.json",
split="train",
)
# In config
tokenizer:
_component_: torchtune.models.gemma.gemma_tokenizer
path: /tmp/gemma-7b/tokenizer.model
# Tokenizer is passed into the dataset in the recipe
dataset:
_component_: torchtune.datasets.instruct_dataset
source: json
data_files: data/my_data.json
split: train
Renaming columns¶
You can remap the default column names to the column names in your dataset by specifying
column_map
as {"<default column>": "<column in your dataset>"}
. The default column names
are detailed in each of the dataset builders (see instruct_dataset()
and
chat_dataset()
as examples).
For example, if the default column names are “input”, “output” and you need to change them to something else,
such as “prompt”, “response”, then column_map = {"input": "prompt", "output": "response"}
.
# data/my_data.json
[
{"prompt": "hello world", "response": "bye world"},
{"prompt": "are you a robot", "response": "no, I am an AI assistant"},
...
]
from torchtune.models.gemma import gemma_tokenizer
from torchtune.datasets import instruct_dataset
g_tokenizer = gemma_tokenizer("/tmp/gemma-7b/tokenizer.model")
ds = instruct_dataset(
tokenizer=g_tokenizer,
source="json",
data_files="data/my_data.json",
split="train",
column_map={"input": "prompt", "output": "response"},
)
# Tokenizer is passed into the dataset in the recipe
dataset:
_component_: torchtune.datasets.instruct_dataset
source: json
data_files: data/my_data.json
split: train
column_map:
input: prompt
output: response
Instruct templates¶
Typically for instruct datasets, you will want to add a PromptTemplate
to provide task-relevant
information. For example, for a grammar correction task, we may want to use a prompt template like GrammarErrorCorrectionTemplate
to structure each of our samples. Prompt templates are passed into the tokenizer and automatically applied to the dataset
you are fine-tuning on. See Using prompt templates for more details.