Chat Datasets¶
Chat datasets involve multi-turn conversations (multiple back-and-forths) between user and assistant.
[
{"role": "user", "content": "What is the answer to the ultimate question of life?"},
{"role": "assistant", "content": "The answer is 42."},
{"role": "user", "content": "That's ridiculous"},
{"role": "assistant", "content": "Oh I know."},
]
This is more structured than freeform text association that models are typically pre-trained with, where they learn to simply predict the next token instead of responding accurately to the user.
The primary entry point for fine-tuning with chat datasets in torchtune is the chat_dataset()
builder. This lets you specify a local or Hugging Face dataset that follows the chat data format
directly from the config and train your LLM on it.
Example chat dataset¶
# data/my_data.json
[
{
"conversations": [
{
"from": "human",
"value": "What is the answer to life?"
},
{
"from": "gpt",
"value": "The answer is 42."
},
{
"from": "human",
"value": "That's ridiculous"
},
{
"from": "gpt",
"value": "Oh I know."
}
]
}
]
from torchtune.models.mistral import mistral_tokenizer
from torchtune.datasets import chat_dataset
m_tokenizer = mistral_tokenizer(
path="/tmp/Mistral-7B-v0.1/tokenizer.model",
prompt_template="torchtune.models.mistral.MistralChatTemplate",
max_seq_len=8192,
)
ds = chat_dataset(
tokenizer=m_tokenizer,
source="json",
data_files="data/my_data.json",
split="train",
conversation_column="conversations",
conversation_style="sharegpt",
# By default, user prompt is ignored in loss. Set to True to include it
train_on_input=True,
new_system_prompt=None,
)
tokenized_dict = ds[0]
tokens, labels = tokenized_dict["tokens"], tokenized_dict["labels"]
print(m_tokenizer.decode(tokens))
# [INST] What is the answer to life? [/INST] The answer is 42. [INST] That's ridiculous [/INST] Oh I know.
print(labels)
# [1, 733, 16289, 28793, 1824, 349, 272, 4372, ...]
# In config
tokenizer:
_component_: torchtune.models.mistral.mistral_tokenizer
path: /tmp/Mistral-7B-v0.1/tokenizer.model
prompt_template: torchtune.models.mistral.MistralChatTemplate
max_seq_len: 8192
dataset:
_component_: torchtune.datasets.chat_dataset
source: json
data_files: data/my_data.json
split: train
conversation_column: conversations
conversation_style: sharegpt
train_on_input: True
new_system_prompt: null
Chat dataset format¶
Chat datasets typically have a single column named “conversations” or “messages” that contains a list of messages on a single topic per sample. The list of messages could include a system prompt, multiple turns between user and assistant, and tool calls/returns.
| conversations |
|--------------------------------------------------------------|
| [{"role": "user", "content": "What day is today?"}, |
| {"role": "assistant", "content": "It is Tuesday."}] |
| [{"role": "user", "content": "What about tomorrow?"}, |
| {"role": "assistant", "content": "Tomorrow is Wednesday."}] |
As an example, you can see the schema of the SlimOrca dataset.
Loading chat datasets from Hugging Face¶
You need to pass in the dataset repo name to source
, select one of the conversation styles in conversation_style
, and specify the conversation_column
.
For most HF datasets, you will also need to specify the split
.
from torchtune.models.gemma import gemma_tokenizer
from torchtune.datasets import chat_dataset
g_tokenizer = gemma_tokenizer("/tmp/gemma-7b/tokenizer.model")
ds = chat_dataset(
tokenizer=g_tokenizer,
source="Open-Orca/SlimOrca-Dedup",
conversation_column="conversations",
conversation_style="sharegpt",
split="train",
)
# Tokenizer is passed into the dataset in the recipe
dataset:
_component_: torchtune.datasets.chat_dataset
source: Open-Orca/SlimOrca-Dedup
conversation_column: conversations
conversation_style: sharegpt
split: train
Loading local and remote chat datasets¶
To load in a local or remote dataset via https that has conversational data, you need to additionally specify the data_files
and split
arguments. See Hugging Face’s load_dataset
documentation
for more details on loading local or remote files.
from torchtune.models.gemma import gemma_tokenizer
from torchtune.datasets import chat_dataset
g_tokenizer = gemma_tokenizer("/tmp/gemma-7b/tokenizer.model")
ds = chat_dataset(
tokenizer=g_tokenizer,
source="json",
conversation_column="conversations",
conversation_style="sharegpt",
data_files="data/my_data.json",
split="train",
)
# Tokenizer is passed into the dataset in the recipe
dataset:
_component_: torchtune.datasets.chat_dataset
source: json
conversation_column: conversations
conversation_style: sharegpt
data_files: data/my_data.json
split: train
Specifying conversation style¶
The structure of the conversation in the raw dataset can vary widely with different role names and different fields
indicating the message content name. There are a few standardized formats that are common across many datasets.
We have built-in converters to convert these standardized formats into a list of torchtune Message
that follows this format:
[
{
"role": "system" | "user" | "assistant" | "ipython",
"content": <message>,
},
...
]
"openai"
¶
The associated message transform is OpenAIToMessages
. The expected format is:
{
"messages": [
{
"role": "system" | "user" | "assistant",
"content": <message>,
},
...
]
}
You can specify conversation_style=openai
in code or config:
from torchtune.models.gemma import gemma_tokenizer
from torchtune.datasets import chat_dataset
g_tokenizer = gemma_tokenizer("/tmp/gemma-7b/tokenizer.model")
ds = chat_dataset(
tokenizer=g_tokenizer,
source="json",
conversation_column="conversations",
conversation_style="openai",
data_files="data/my_data.json",
split="train",
)
# Tokenizer is passed into the dataset in the recipe
dataset:
_component_: torchtune.datasets.chat_dataset
source: json
conversation_column: conversations
conversation_style: openai
data_files: data/my_data.json
split: train
If your dataset does not fit one of the above conversation styles, then you will need to create a custom message transform.
Renaming columns¶
To specify the column that contains your conversation data, use conversation_column
.
# data/my_data.json
[
{
"dialogue": [
{
"from": "human",
"value": "What is the answer to life?"
},
{
"from": "gpt",
"value": "The answer is 42."
},
{
"from": "human",
"value": "That's ridiculous"
},
{
"from": "gpt",
"value": "Oh I know."
}
]
}
]
from torchtune.models.gemma import gemma_tokenizer
from torchtune.datasets import chat_dataset
g_tokenizer = gemma_tokenizer("/tmp/gemma-7b/tokenizer.model")
ds = chat_dataset(
tokenizer=g_tokenizer,
source="json",
conversation_column="dialogue",
conversation_style="sharegpt",
data_files="data/my_data.json",
split="train",
)
# Tokenizer is passed into the dataset in the recipe
dataset:
_component_: torchtune.datasets.chat_dataset
source: json
conversation_column: dialogue
conversation_style: sharegpt
data_files: data/my_data.json
split: train
Chat templates¶
Chat templates are defined the same way as instruct templates in instruct_dataset()
. See Instruct templates for more info.