.. _chat_dataset_usage_label:

=============
Chat Datasets
=============

Chat datasets involve multi-turn conversations (multiple back-and-forths) between user and assistant.

.. code-block:: python

    [
        {"role": "user", "content": "What is the answer to the ultimate question of life?"},
        {"role": "assistant", "content": "The answer is 42."},
        {"role": "user", "content": "That's ridiculous"},
        {"role": "assistant", "content": "Oh I know."},
    ]

This is more structured than freeform text association that models are typically pre-trained with,
where they learn to simply predict the next token instead of responding accurately to the user.

The primary entry point for fine-tuning with chat datasets in torchtune is the :func:`~torchtune.datasets.chat_dataset`
builder. This lets you specify a local or Hugging Face dataset that follows the chat data format
directly from the config and train your LLM on it.

.. _example_chat:

Example chat dataset
--------------------

.. code-block:: python

    # data/my_data.json
    [
        {
            "conversations": [
                {
                    "from": "human",
                    "value": "What is the answer to life?"
                },
                {
                    "from": "gpt",
                    "value": "The answer is 42."
                },
                {
                    "from": "human",
                    "value": "That's ridiculous"
                },
                {
                    "from": "gpt",
                    "value": "Oh I know."
                }
            ]
        }
    ]

.. code-block:: python

    from torchtune.models.mistral import mistral_tokenizer
    from torchtune.datasets import chat_dataset

    m_tokenizer = mistral_tokenizer(
        path="/tmp/Mistral-7B-v0.1/tokenizer.model",
        prompt_template="torchtune.models.mistral.MistralChatTemplate",
        max_seq_len=8192,
    )
    ds = chat_dataset(
        tokenizer=m_tokenizer,
        source="json",
        data_files="data/my_data.json",
        split="train",
        conversation_column="conversations",
        conversation_style="sharegpt",
        # By default, user prompt is ignored in loss. Set to True to include it
        train_on_input=True,
        new_system_prompt=None,
    )
    tokenized_dict = ds[0]
    tokens, labels = tokenized_dict["tokens"], tokenized_dict["labels"]
    print(m_tokenizer.decode(tokens))
    # [INST] What is the answer to life?  [/INST] The answer is 42. [INST] That's ridiculous  [/INST] Oh I know.
    print(labels)
    # [1, 733, 16289, 28793, 1824, 349, 272, 4372, ...]

.. code-block:: yaml

    # In config
    tokenizer:
      _component_: torchtune.models.mistral.mistral_tokenizer
      path: /tmp/Mistral-7B-v0.1/tokenizer.model
      prompt_template: torchtune.models.mistral.MistralChatTemplate
      max_seq_len: 8192

    dataset:
      _component_: torchtune.datasets.chat_dataset
      source: json
      data_files: data/my_data.json
      split: train
      conversation_column: conversations
      conversation_style: sharegpt
      train_on_input: True
      new_system_prompt: null

Chat dataset format
-------------------

Chat datasets typically have a single column named "conversations" or "messages" that contains a list of messages on a single topic
per sample. The list of messages could include a system prompt, multiple turns between user and assistant, and tool calls/returns.

.. code-block:: text

    |  conversations                                               |
    |--------------------------------------------------------------|
    | [{"role": "user", "content": "What day is today?"},          |
    |  {"role": "assistant", "content": "It is Tuesday."}]         |
    | [{"role": "user", "content": "What about tomorrow?"},        |
    |  {"role": "assistant", "content": "Tomorrow is Wednesday."}] |

As an example, you can see the schema of the `SlimOrca dataset <https://huggingface.co/datasets/Open-Orca/SlimOrca-Dedup>`_.

Loading chat datasets from Hugging Face
---------------------------------------

You need to pass in the dataset repo name to ``source``, select one of the conversation styles in ``conversation_style``, and specify the ``conversation_column``.
For most HF datasets, you will also need to specify the ``split``.

.. code-block:: python

    from torchtune.models.gemma import gemma_tokenizer
    from torchtune.datasets import chat_dataset

    g_tokenizer = gemma_tokenizer("/tmp/gemma-7b/tokenizer.model")
    ds = chat_dataset(
        tokenizer=g_tokenizer,
        source="Open-Orca/SlimOrca-Dedup",
        conversation_column="conversations",
        conversation_style="sharegpt",
        split="train",
    )

.. code-block:: yaml

    # Tokenizer is passed into the dataset in the recipe
    dataset:
      _component_: torchtune.datasets.chat_dataset
      source: Open-Orca/SlimOrca-Dedup
      conversation_column: conversations
      conversation_style: sharegpt
      split: train


Loading local and remote chat datasets
--------------------------------------

To load in a local or remote dataset via https that has conversational data, you need to additionally specify the ``data_files`` and ``split``
arguments. See Hugging Face's ``load_dataset`` `documentation <https://huggingface.co/docs/datasets/main/en/loading#local-and-remote-files>`_
for more details on loading local or remote files.

.. code-block:: python

    from torchtune.models.gemma import gemma_tokenizer
    from torchtune.datasets import chat_dataset

    g_tokenizer = gemma_tokenizer("/tmp/gemma-7b/tokenizer.model")
    ds = chat_dataset(
        tokenizer=g_tokenizer,
        source="json",
        conversation_column="conversations",
        conversation_style="sharegpt",
        data_files="data/my_data.json",
        split="train",
    )

.. code-block:: yaml

    # Tokenizer is passed into the dataset in the recipe
    dataset:
      _component_: torchtune.datasets.chat_dataset
      source: json
      conversation_column: conversations
      conversation_style: sharegpt
      data_files: data/my_data.json
      split: train

Specifying conversation style
-----------------------------

The structure of the conversation in the raw dataset can vary widely with different role names and different fields
indicating the message content name. There are a few standardized formats that are common across many datasets.
We have built-in converters to convert these standardized formats into a list of torchtune :class:`~torchtune.data.Message`
that follows this format:

.. code-block:: python

    [
        {
            "role": "system" | "user" | "assistant" | "ipython",
            "content": <message>,
        },
        ...
    ]

.. _sharegpt:

``"sharegpt"``
^^^^^^^^^^^^^^
The associated message transform is :class:`~torchtune.data.ShareGPTToMessages`. The expected format is:

.. code-block:: python

    {
        "conversations": [
            {
                "from": "system" | "human" | "gpt",
                "value": <message>,
            },
            ...
        ]
    }

You can specify ``conversation_style=sharegpt`` in code or config:

.. code-block:: python

    from torchtune.models.gemma import gemma_tokenizer
    from torchtune.datasets import chat_dataset

    g_tokenizer = gemma_tokenizer("/tmp/gemma-7b/tokenizer.model")
    ds = chat_dataset(
        tokenizer=g_tokenizer,
        source="json",
        conversation_column="conversations",
        conversation_style="sharegpt",
        data_files="data/my_data.json",
        split="train",
    )

.. code-block:: yaml

    # Tokenizer is passed into the dataset in the recipe
    dataset:
      _component_: torchtune.datasets.chat_dataset
      source: json
      conversation_column: conversations
      conversation_style: sharegpt
      data_files: data/my_data.json
      split: train

``"openai"``
^^^^^^^^^^^^
The associated message transform is :class:`~torchtune.data.OpenAIToMessages`. The expected format is:

.. code-block:: python

    {
        "messages": [
            {
                "role": "system" | "user" | "assistant",
                "content": <message>,
            },
            ...
        ]
    }

You can specify ``conversation_style=openai`` in code or config:

.. code-block:: python

    from torchtune.models.gemma import gemma_tokenizer
    from torchtune.datasets import chat_dataset

    g_tokenizer = gemma_tokenizer("/tmp/gemma-7b/tokenizer.model")
    ds = chat_dataset(
        tokenizer=g_tokenizer,
        source="json",
        conversation_column="conversations",
        conversation_style="openai",
        data_files="data/my_data.json",
        split="train",
    )

.. code-block:: yaml

    # Tokenizer is passed into the dataset in the recipe
    dataset:
      _component_: torchtune.datasets.chat_dataset
      source: json
      conversation_column: conversations
      conversation_style: openai
      data_files: data/my_data.json
      split: train

If your dataset does not fit one of the above conversation styles, then you will need to create a custom message transform.


Renaming columns
----------------

To specify the column that contains your conversation data, use ``conversation_column``.

.. code-block:: python

    # data/my_data.json
    [
        {
            "dialogue": [
                {
                    "from": "human",
                    "value": "What is the answer to life?"
                },
                {
                    "from": "gpt",
                    "value": "The answer is 42."
                },
                {
                    "from": "human",
                    "value": "That's ridiculous"
                },
                {
                    "from": "gpt",
                    "value": "Oh I know."
                }
            ]
        }
    ]

.. code-block:: python

    from torchtune.models.gemma import gemma_tokenizer
    from torchtune.datasets import chat_dataset

    g_tokenizer = gemma_tokenizer("/tmp/gemma-7b/tokenizer.model")
    ds = chat_dataset(
        tokenizer=g_tokenizer,
        source="json",
        conversation_column="dialogue",
        conversation_style="sharegpt",
        data_files="data/my_data.json",
        split="train",
    )

.. code-block:: yaml

    # Tokenizer is passed into the dataset in the recipe
    dataset:
      _component_: torchtune.datasets.chat_dataset
      source: json
      conversation_column: dialogue
      conversation_style: sharegpt
      data_files: data/my_data.json
      split: train


Chat templates
--------------

Chat templates are defined the same way as instruct templates in :func:`~torchtune.datasets.instruct_dataset`. See :ref:`instruct_template` for more info.


Built-in chat datasets
----------------------
- :class:`~torchtune.datasets.slimorca_dataset`