Instruct Datasets

Instruction tuning involves training an LLM to perform specific task(s). This typically takes the form of a user command or prompt and the assistant’s response, along with an optional system prompt that describes the task at hand. This is more structured than freeform text association that models are typically pre-trained with, where they learn to specifically predict the next token instead of completing the task.

The primary entry point for fine-tuning with instruct datasets in torchtune is the instruct_dataset() builder. This lets you specify a local or Hugging Face dataset that follows the instruct data format directly from the config and train your LLM on it.

Example instruct dataset

Here is an example of an instruct dataset to fine-tune for a grammar correction task.

head data/my_data.csv
# incorrect,correct
# This are a cat,This is a cat.

from torchtune.models.gemma import gemma_tokenizer
from torchtune.datasets import instruct_dataset

g_tokenizer = gemma_tokenizer(
    path="/tmp/gemma-7b/tokenizer.model",
    prompt_template="torchtune.data.GrammarErrorCorrectionTemplate",
    max_seq_len=8192,
)
ds = instruct_dataset(
    tokenizer=g_tokenizer,
    source="csv",
    data_files="data/my_data.csv",
    split="train",
    # By default, user prompt is ignored in loss. Set to True to include it
    train_on_input=True,
    # Prepend a system message to every sample
    new_system_prompt="You are an AI assistant. ",
    # Use columns in our dataset instead of default
    column_map={"input": "incorrect", "output": "correct"},
)
tokenized_dict = ds[0]
tokens, labels = tokenized_dict["tokens"], tokenized_dict["labels"]
print(g_tokenizer.decode(tokens))
# You are an AI assistant. Correct this to standard English:This are a cat---\nCorrected:This is a cat.
print(labels)  # System message is masked out, but not user message
# [-100, -100, -100, -100, -100, -100, 27957, 736, 577, ...]

# In config
tokenizer:
  _component_: torchtune.models.gemma.gemma_tokenizer
  path: /tmp/gemma-7b/tokenizer.model
  prompt_template: torchtune.data.GrammarErrorCorrectionTemplate
  max_seq_len: 8192

dataset:
  source: csv
  data_files: data/my_data.csv
  split: train
  train_on_input: True
  new_system_prompt: You are an AI assistant.
  column_map:
    input: incorrect
    output: correct

Instruct dataset format

Instruct datasets are expected to follow an input-output format, where the user prompt is in one column and the assistant prompt is in another column.

|  input          |  output          |
|-----------------|------------------|
| "user prompt"   | "model response" |

As an example, you can see the schema of the C4 200M dataset.

Loading instruct datasets from Hugging Face

You simply need to pass in the dataset repo name to source, which is then passed into Hugging Face’s load_dataset. For most datasets, you will also need to specify the split.

# In code
from torchtune.models.gemma import gemma_tokenizer
from torchtune.datasets import instruct_dataset

g_tokenizer = gemma_tokenizer("/tmp/gemma-7b/tokenizer.model")
ds = instruct_dataset(
    tokenizer=g_tokenizer,
    source="liweili/c4_200m",
    split="train"
)

# In config
tokenizer:
  _component_: torchtune.models.gemma.gemma_tokenizer
  path: /tmp/gemma-7b/tokenizer.model

# Tokenizer is passed into the dataset in the recipe
dataset:
  _component_: torchtune.datasets.instruct_dataset
  source: liweili/c4_200m
  split: train

This will use the default column names “input” and “output”. To change the column names, use the column_map argument (see Renaming columns).

Loading local and remote instruct datasets

To load in a local or remote dataset via https that follows the instruct format, you need to specify the source, data_files and split arguments. See Hugging Face’s load_dataset documentation for more details on loading local or remote files.

# In code
from torchtune.models.gemma import gemma_tokenizer
from torchtune.datasets import instruct_dataset

g_tokenizer = gemma_tokenizer("/tmp/gemma-7b/tokenizer.model")
ds = instruct_dataset(
    tokenizer=g_tokenizer,
    source="json",
    data_files="data/my_data.json",
    split="train",
)

# In config
tokenizer:
  _component_: torchtune.models.gemma.gemma_tokenizer
  path: /tmp/gemma-7b/tokenizer.model

# Tokenizer is passed into the dataset in the recipe
dataset:
  _component_: torchtune.datasets.instruct_dataset
  source: json
  data_files: data/my_data.json
  split: train

Renaming columns

You can remap the default column names to the column names in your dataset by specifying column_map as {"<default column>": "<column in your dataset>"}. The default column names are detailed in each of the dataset builders (see instruct_dataset() and chat_dataset() as examples).

For example, if the default column names are “input”, “output” and you need to change them to something else, such as “prompt”, “response”, then column_map = {"input": "prompt", "output": "response"}.

# data/my_data.json
[
    {"prompt": "hello world", "response": "bye world"},
    {"prompt": "are you a robot", "response": "no, I am an AI assistant"},
    ...
]

from torchtune.models.gemma import gemma_tokenizer
from torchtune.datasets import instruct_dataset

g_tokenizer = gemma_tokenizer("/tmp/gemma-7b/tokenizer.model")
ds = instruct_dataset(
    tokenizer=g_tokenizer,
    source="json",
    data_files="data/my_data.json",
    split="train",
    column_map={"input": "prompt", "output": "response"},
)

# Tokenizer is passed into the dataset in the recipe
dataset:
  _component_: torchtune.datasets.instruct_dataset
  source: json
  data_files: data/my_data.json
  split: train
  column_map:
    input: prompt
    output: response

Instruct templates

Typically for instruct datasets, you will want to add a PromptTemplate to provide task-relevant information. For example, for a grammar correction task, we may want to use a prompt template like GrammarErrorCorrectionTemplate to structure each of our samples. Prompt templates are passed into the tokenizer and automatically applied to the dataset you are fine-tuning on. See Using prompt templates for more details.

Instruct Datasets

Example instruct dataset

Instruct dataset format

Loading instruct datasets from Hugging Face

Loading local and remote instruct datasets

Renaming columns

Instruct templates

Built-in instruct datasets

Docs

Tutorials

Resources