Chat Datasets

Chat datasets involve multi-turn conversations (multiple back-and-forths) between user and assistant.

[
    {"role": "user", "content": "What is the answer to the ultimate question of life?"},
    {"role": "assistant", "content": "The answer is 42."},
    {"role": "user", "content": "That's ridiculous"},
    {"role": "assistant", "content": "Oh I know."},
]

This is more structured than freeform text association that models are typically pre-trained with, where they learn to simply predict the next token instead of responding accurately to the user.

The primary entry point for fine-tuning with chat datasets in torchtune is the chat_dataset() builder. This lets you specify a local or Hugging Face dataset that follows the chat data format directly from the config and train your LLM on it.

Example chat dataset

# data/my_data.json
[
    {
        "conversations": [
            {
                "from": "human",
                "value": "What is the answer to life?"
            },
            {
                "from": "gpt",
                "value": "The answer is 42."
            },
            {
                "from": "human",
                "value": "That's ridiculous"
            },
            {
                "from": "gpt",
                "value": "Oh I know."
            }
        ]
    }
]

from torchtune.models.mistral import mistral_tokenizer
from torchtune.datasets import chat_dataset

m_tokenizer = mistral_tokenizer(
    path="/tmp/Mistral-7B-v0.1/tokenizer.model",
    prompt_template="torchtune.models.mistral.MistralChatTemplate",
    max_seq_len=8192,
)
ds = chat_dataset(
    tokenizer=m_tokenizer,
    source="json",
    data_files="data/my_data.json",
    split="train",
    conversation_column="conversations",
    conversation_style="sharegpt",
    # By default, user prompt is ignored in loss. Set to True to include it
    train_on_input=True,
    new_system_prompt=None,
)
tokenized_dict = ds[0]
tokens, labels = tokenized_dict["tokens"], tokenized_dict["labels"]
print(m_tokenizer.decode(tokens))
# [INST] What is the answer to life?  [/INST] The answer is 42. [INST] That's ridiculous  [/INST] Oh I know.
print(labels)
# [1, 733, 16289, 28793, 1824, 349, 272, 4372, ...]

# In config
tokenizer:
  _component_: torchtune.models.mistral.mistral_tokenizer
  path: /tmp/Mistral-7B-v0.1/tokenizer.model
  prompt_template: torchtune.models.mistral.MistralChatTemplate
  max_seq_len: 8192

dataset:
  _component_: torchtune.datasets.chat_dataset
  source: json
  data_files: data/my_data.json
  split: train
  conversation_column: conversations
  conversation_style: sharegpt
  train_on_input: True
  new_system_prompt: null

Chat dataset format

Chat datasets typically have a single column named “conversations” or “messages” that contains a list of messages on a single topic per sample. The list of messages could include a system prompt, multiple turns between user and assistant, and tool calls/returns.

|  conversations                                               |
|--------------------------------------------------------------|
| [{"role": "user", "content": "What day is today?"},          |
|  {"role": "assistant", "content": "It is Tuesday."}]         |
| [{"role": "user", "content": "What about tomorrow?"},        |
|  {"role": "assistant", "content": "Tomorrow is Wednesday."}] |

As an example, you can see the schema of the SlimOrca dataset.

Loading chat datasets from Hugging Face

You need to pass in the dataset repo name to source, select one of the conversation styles in conversation_style, and specify the conversation_column. For most HF datasets, you will also need to specify the split.

from torchtune.models.gemma import gemma_tokenizer
from torchtune.datasets import chat_dataset

g_tokenizer = gemma_tokenizer("/tmp/gemma-7b/tokenizer.model")
ds = chat_dataset(
    tokenizer=g_tokenizer,
    source="Open-Orca/SlimOrca-Dedup",
    conversation_column="conversations",
    conversation_style="sharegpt",
    split="train",
)

# Tokenizer is passed into the dataset in the recipe
dataset:
  _component_: torchtune.datasets.chat_dataset
  source: Open-Orca/SlimOrca-Dedup
  conversation_column: conversations
  conversation_style: sharegpt
  split: train

Loading local and remote chat datasets

To load in a local or remote dataset via https that has conversational data, you need to additionally specify the data_files and split arguments. See Hugging Face’s load_dataset documentation for more details on loading local or remote files.

from torchtune.models.gemma import gemma_tokenizer
from torchtune.datasets import chat_dataset

g_tokenizer = gemma_tokenizer("/tmp/gemma-7b/tokenizer.model")
ds = chat_dataset(
    tokenizer=g_tokenizer,
    source="json",
    conversation_column="conversations",
    conversation_style="sharegpt",
    data_files="data/my_data.json",
    split="train",
)

# Tokenizer is passed into the dataset in the recipe
dataset:
  _component_: torchtune.datasets.chat_dataset
  source: json
  conversation_column: conversations
  conversation_style: sharegpt
  data_files: data/my_data.json
  split: train

Specifying conversation style

The structure of the conversation in the raw dataset can vary widely with different role names and different fields indicating the message content name. There are a few standardized formats that are common across many datasets. We have built-in converters to convert these standardized formats into a list of torchtune Message that follows this format:

[
    {
        "role": "system" | "user" | "assistant" | "ipython",
        "content": <message>,
    },
    ...
]

`"sharegpt"`

The associated message transform is ShareGPTToMessages. The expected format is:

{
    "conversations": [
        {
            "from": "system" | "human" | "gpt",
            "value": <message>,
        },
        ...
    ]
}

You can specify conversation_style=sharegpt in code or config:

from torchtune.models.gemma import gemma_tokenizer
from torchtune.datasets import chat_dataset

g_tokenizer = gemma_tokenizer("/tmp/gemma-7b/tokenizer.model")
ds = chat_dataset(
    tokenizer=g_tokenizer,
    source="json",
    conversation_column="conversations",
    conversation_style="sharegpt",
    data_files="data/my_data.json",
    split="train",
)

# Tokenizer is passed into the dataset in the recipe
dataset:
  _component_: torchtune.datasets.chat_dataset
  source: json
  conversation_column: conversations
  conversation_style: sharegpt
  data_files: data/my_data.json
  split: train

`"openai"`

The associated message transform is OpenAIToMessages. The expected format is:

{
    "messages": [
        {
            "role": "system" | "user" | "assistant",
            "content": <message>,
        },
        ...
    ]
}

You can specify conversation_style=openai in code or config:

from torchtune.models.gemma import gemma_tokenizer
from torchtune.datasets import chat_dataset

g_tokenizer = gemma_tokenizer("/tmp/gemma-7b/tokenizer.model")
ds = chat_dataset(
    tokenizer=g_tokenizer,
    source="json",
    conversation_column="conversations",
    conversation_style="openai",
    data_files="data/my_data.json",
    split="train",
)

# Tokenizer is passed into the dataset in the recipe
dataset:
  _component_: torchtune.datasets.chat_dataset
  source: json
  conversation_column: conversations
  conversation_style: openai
  data_files: data/my_data.json
  split: train

If your dataset does not fit one of the above conversation styles, then you will need to create a custom message transform.

Renaming columns

To specify the column that contains your conversation data, use conversation_column.

# data/my_data.json
[
    {
        "dialogue": [
            {
                "from": "human",
                "value": "What is the answer to life?"
            },
            {
                "from": "gpt",
                "value": "The answer is 42."
            },
            {
                "from": "human",
                "value": "That's ridiculous"
            },
            {
                "from": "gpt",
                "value": "Oh I know."
            }
        ]
    }
]

from torchtune.models.gemma import gemma_tokenizer
from torchtune.datasets import chat_dataset

g_tokenizer = gemma_tokenizer("/tmp/gemma-7b/tokenizer.model")
ds = chat_dataset(
    tokenizer=g_tokenizer,
    source="json",
    conversation_column="dialogue",
    conversation_style="sharegpt",
    data_files="data/my_data.json",
    split="train",
)

# Tokenizer is passed into the dataset in the recipe
dataset:
  _component_: torchtune.datasets.chat_dataset
  source: json
  conversation_column: dialogue
  conversation_style: sharegpt
  data_files: data/my_data.json
  split: train

Chat templates

Chat templates are defined the same way as instruct templates in instruct_dataset(). See Instruct templates for more info.

Built-in chat datasets

slimorca_dataset