.. _instruct_dataset_usage_label: ================= Instruct Datasets ================= Instruction tuning involves training an LLM to perform specific task(s). This typically takes the form of a user command or prompt and the assistant's response, along with an optional system prompt that describes the task at hand. This is more structured than freeform text association that models are typically pre-trained with, where they learn to specifically predict the next token instead of completing the task. The primary entry point for fine-tuning with instruct datasets in torchtune is the :func:`~torchtune.datasets.instruct_dataset` builder. This lets you specify a local or Hugging Face dataset that follows the instruct data format directly from the config and train your LLM on it. .. _example_instruct: Example instruct dataset ------------------------ Here is an example of an instruct dataset to fine-tune for a grammar correction task. .. code-block:: bash head data/my_data.csv # incorrect,correct # This are a cat,This is a cat. .. code-block:: python from torchtune.models.gemma import gemma_tokenizer from torchtune.datasets import instruct_dataset g_tokenizer = gemma_tokenizer( path="/tmp/gemma-7b/tokenizer.model", prompt_template="torchtune.data.GrammarErrorCorrectionTemplate", max_seq_len=8192, ) ds = instruct_dataset( tokenizer=g_tokenizer, source="csv", data_files="data/my_data.csv", split="train", # By default, user prompt is ignored in loss. Set to True to include it train_on_input=True, # Prepend a system message to every sample new_system_prompt="You are an AI assistant. ", # Use columns in our dataset instead of default column_map={"input": "incorrect", "output": "correct"}, ) tokenized_dict = ds[0] tokens, labels = tokenized_dict["tokens"], tokenized_dict["labels"] print(g_tokenizer.decode(tokens)) # You are an AI assistant. Correct this to standard English:This are a cat---\nCorrected:This is a cat. print(labels) # System message is masked out, but not user message # [-100, -100, -100, -100, -100, -100, 27957, 736, 577, ...] .. code-block:: yaml # In config tokenizer: _component_: torchtune.models.gemma.gemma_tokenizer path: /tmp/gemma-7b/tokenizer.model prompt_template: torchtune.data.GrammarErrorCorrectionTemplate max_seq_len: 8192 dataset: source: csv data_files: data/my_data.csv split: train train_on_input: True new_system_prompt: You are an AI assistant. column_map: input: incorrect output: correct Instruct dataset format ----------------------- Instruct datasets are expected to follow an input-output format, where the user prompt is in one column and the assistant prompt is in another column. .. code-block:: text | input | output | |-----------------|------------------| | "user prompt" | "model response" | As an example, you can see the schema of the `C4 200M dataset <https://huggingface.co/datasets/liweili/c4_200m>`_. Loading instruct datasets from Hugging Face ------------------------------------------- You simply need to pass in the dataset repo name to ``source``, which is then passed into Hugging Face's ``load_dataset``. For most datasets, you will also need to specify the ``split``. .. code-block:: python # In code from torchtune.models.gemma import gemma_tokenizer from torchtune.datasets import instruct_dataset g_tokenizer = gemma_tokenizer("/tmp/gemma-7b/tokenizer.model") ds = instruct_dataset( tokenizer=g_tokenizer, source="liweili/c4_200m", split="train" ) .. code-block:: yaml # In config tokenizer: _component_: torchtune.models.gemma.gemma_tokenizer path: /tmp/gemma-7b/tokenizer.model # Tokenizer is passed into the dataset in the recipe dataset: _component_: torchtune.datasets.instruct_dataset source: liweili/c4_200m split: train This will use the default column names "input" and "output". To change the column names, use the ``column_map`` argument (see :ref:`column_map`). Loading local and remote instruct datasets ------------------------------------------ To load in a local or remote dataset via https that follows the instruct format, you need to specify the ``source``, ``data_files`` and ``split`` arguments. See Hugging Face's ``load_dataset`` `documentation <https://huggingface.co/docs/datasets/main/en/loading#local-and-remote-files>`_ for more details on loading local or remote files. .. code-block:: python # In code from torchtune.models.gemma import gemma_tokenizer from torchtune.datasets import instruct_dataset g_tokenizer = gemma_tokenizer("/tmp/gemma-7b/tokenizer.model") ds = instruct_dataset( tokenizer=g_tokenizer, source="json", data_files="data/my_data.json", split="train", ) .. code-block:: yaml # In config tokenizer: _component_: torchtune.models.gemma.gemma_tokenizer path: /tmp/gemma-7b/tokenizer.model # Tokenizer is passed into the dataset in the recipe dataset: _component_: torchtune.datasets.instruct_dataset source: json data_files: data/my_data.json split: train .. _column_map: Renaming columns ---------------- You can remap the default column names to the column names in your dataset by specifying ``column_map`` as ``{"<default column>": "<column in your dataset>"}``. The default column names are detailed in each of the dataset builders (see :func:`~torchtune.datasets.instruct_dataset` and :func:`~torchtune.datasets.chat_dataset` as examples). For example, if the default column names are "input", "output" and you need to change them to something else, such as "prompt", "response", then ``column_map = {"input": "prompt", "output": "response"}``. .. code-block:: python # data/my_data.json [ {"prompt": "hello world", "response": "bye world"}, {"prompt": "are you a robot", "response": "no, I am an AI assistant"}, ... ] .. code-block:: python from torchtune.models.gemma import gemma_tokenizer from torchtune.datasets import instruct_dataset g_tokenizer = gemma_tokenizer("/tmp/gemma-7b/tokenizer.model") ds = instruct_dataset( tokenizer=g_tokenizer, source="json", data_files="data/my_data.json", split="train", column_map={"input": "prompt", "output": "response"}, ) .. code-block:: yaml # Tokenizer is passed into the dataset in the recipe dataset: _component_: torchtune.datasets.instruct_dataset source: json data_files: data/my_data.json split: train column_map: input: prompt output: response .. _instruct_template: Instruct templates ------------------ Typically for instruct datasets, you will want to add a :class:`~torchtune.data.PromptTemplate` to provide task-relevant information. For example, for a grammar correction task, we may want to use a prompt template like :class:`~torchtune.data.GrammarErrorCorrectionTemplate` to structure each of our samples. Prompt templates are passed into the tokenizer and automatically applied to the dataset you are fine-tuning on. See :ref:`using_prompt_templates` for more details. Built-in instruct datasets -------------------------- - :class:`~torchtune.datasets.alpaca_dataset` - :class:`~torchtune.datasets.grammar_dataset` - :class:`~torchtune.datasets.samsum_dataset`