.. _config_tutorial_label:

=================
All About Configs
=================

This deep-dive will guide you through writing configs for running recipes.

.. grid:: 2

    .. grid-item-card:: :octicon:`mortar-board;1em;` What this deep-dive will cover

      * How to write a YAML config and run a recipe with it
      * How to use :code:`instantiate` and :code:`parse` APIs
      * How to effectively use configs and CLI overrides for running recipes

    .. grid-item-card:: :octicon:`list-unordered;1em;` Prerequisites

      * Be familiar with the :ref:`overview of torchtune<overview_label>`
      * Make sure to :ref:`install torchtune<install_label>`
      * Understand the :ref:`fundamentals of recipes<recipe_deepdive>`


Where do parameters live?
-------------------------

There are two primary entry points for you to configure parameters: **configs** and
**CLI overrides**. Configs are YAML files that define all the
parameters needed to run a recipe within a single location. They are the single
source of truth for reproducing a run. The config parameters can be overridden on the
command-line using :code:`tune` for quick changes and experimentation without
modifying the config.


Writing configs
---------------
Configs serve as the primary entry point for running recipes in torchtune. They are
expected to be YAML files and they simply list out values for parameters you want to define
for a particular run.

.. code-block:: yaml

    seed: null
    shuffle: True
    device: cuda
    dtype: fp32
    enable_fsdp: True
    ...

Configuring components using :func:`instantiate<torchtune.config.instantiate>`
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
Many fields will require specifying torchtune objects with associated keyword
arguments as parameters. Models, datasets, optimizers, and loss functions are
common examples of this. You can easily do this using the :code:`_component_`
subfield. In :code:`_component_`, you need to specify the dotpath of the object
you wish to instantiate in the recipe. The dotpath is the exact path you would use
to import the object normally in a Python file. For example, to specify the
:class:`~torchtune.datasets.alpaca_dataset` in your config with custom
arguments:

.. code-block:: yaml

    dataset:
      _component_: torchtune.datasets.alpaca_dataset
      train_on_input: False

Here, we are changing the default value for :code:`train_on_input` from :code:`True`
to :code:`False`.

Once you've specified the :code:`_component_` in your config, you can create an
instance of the specified object in your recipe's setup like so:

.. code-block:: python

    from torchtune import config

    # Access the dataset field and create the object instance
    dataset = config.instantiate(cfg.dataset)

This will automatically use any keyword arguments specified in the fields under
:code:`dataset`.

As written, the preceding example will actually throw an error. If you look at the method for :class:`~torchtune.datasets.alpaca_dataset`,
you'll notice that we're missing a required positional argument, the tokenizer.
Since this is another configurable torchtune object, let's understand how to handle
this by taking a look at the :func:`~torchtune.config.instantiate` API.

.. code-block:: python

    def instantiate(
        config: DictConfig,
        *args: Any,
        **kwargs: Any,
    )

:func:`~torchtune.config.instantiate` also accepts positional arguments
and keyword arguments and automatically uses that with the config when creating
the object. This means we can not only pass in the tokenizer, but also add additional
keyword arguments not specified in the config if we'd like:

.. code-block:: yaml

    # Tokenizer is needed for the dataset, configure it first
    tokenizer:
      _component_: torchtune.models.llama2.llama2_tokenizer
      path: /tmp/tokenizer.model

    dataset:
      _component_: torchtune.datasets.alpaca_dataset

.. code-block:: python

    # Note the API of the tokenizer we specified - we need to pass in a path
    def llama2_tokenizer(path: str) -> Llama2Tokenizer:

    # Note the API of the dataset we specified - we need to pass in a model tokenizer
    # and any optional keyword arguments
    def alpaca_dataset(
        tokenizer: ModelTokenizer,
        train_on_input: bool = True,
        max_seq_len: int = 512,
    ) -> SFTDataset:

    from torchtune import config

    # Since we've already specified the path in the config, we don't need to pass
    # it in
    tokenizer = config.instantiate(cfg.tokenizer)
    # We pass in the instantiated tokenizer as the first required argument, then
    # we change an optional keyword argument
    dataset = config.instantiate(
        cfg.dataset,
        tokenizer,
        train_on_input=False,
    )

Note that additional keyword arguments will overwrite any duplicated keys in the
config.

Referencing other config fields with interpolations
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
Sometimes you need to use the same value more than once for multiple fields. You
can use *interpolations* to reference another field, and :func:`~torchtune.config.instantiate`
will automatically resolve it for you.

.. code-block:: yaml

    output_dir: /tmp/alpaca-llama2-finetune
    metric_logger:
      _component_: torchtune.training.metric_logging.DiskLogger
      log_dir: ${output_dir}

Validating your config
^^^^^^^^^^^^^^^^^^^^^^
We provide a convenient CLI utility, :ref:`tune validate<validate_cli_label>`, to quickly verify that
your config is well-formed and all components can be instantiated properly. You
can also pass in overrides if you want to test out the exact commands you will run
your experiments with. If any parameters are not well-formed, :ref:`tune validate<validate_cli_label>`
will list out all the locations where an error was found.

.. code-block:: bash

  tune cp llama2/7B_lora_single_device ./my_config.yaml
  tune validate ./my_config.yaml

Best practices for writing configs
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
Let's discuss some guidelines for writing configs to get the most out of them.

Airtight configs
""""""""""""""""
While it may be tempting to put as much as you can in the config to give you
maximum flexibility in switching parameters for your experiments, we encourage
you to only include fields in the config that will be used or instantiated in the
recipe. This ensures full clarity on the options a recipe was run with and will
make it significantly easier to debug.

.. code-block:: yaml

    # dont do this
    alpaca_dataset:
      _component_: torchtune.datasets.alpaca_dataset
    slimorca_dataset:
      ...

    # do this
    dataset:
      # change this in config or override when needed
      _component_: torchtune.datasets.alpaca_dataset

Use public APIs only
""""""""""""""""""""
If a component you wish to specify in a config is located in a private file, use
the public dotpath in your config. These components are typically exposed in their
parent module's :code:`__init__.py` file. This way, you can guarantee the stability
of the API you are using in your config. There should be no underscores in your
component dotpath.

.. code-block:: yaml

    # don't do this
    dataset:
      _component_: torchtune.datasets._alpaca.alpaca_dataset

    # do this
    dataset:
      _component_: torchtune.datasets.alpaca_dataset

.. _cli_override:

Command-line overrides
----------------------
Configs are the primary location to collect all your parameters to run a recipe,
but sometimes you may want to quickly try different values without having to update
the config itself. To enable quick experimentation, you can specify override values
to parameters in your config via the :code:`tune` command. These should be specified
as key-value pairs :code:`k1=v1 k2=v2 ...`

For example, to run the :ref:`LoRA single-device finetuning <lora_finetune_recipe_label>` recipe with custom model and tokenizer directories, you can provide overrides:

.. code-block:: bash

    tune run lora_finetune_single_device \
    --config llama2/7B_lora_single_device \
    checkpointer.checkpoint_dir=/home/my_model_checkpoint \
    checkpointer.checkpoint_files=['file_1','file_2'] \
    tokenizer.path=/home/my_tokenizer_path

Overriding components
^^^^^^^^^^^^^^^^^^^^^
If you would like to override a class or function in the config that is instantiated
via the :code:`_component_` field, you can do so by assigning to the parameter
name directly. Any nested fields in the components can be overridden with dot notation.

.. code-block:: yaml

    dataset:
      _component_: torchtune.datasets.alpaca_dataset

.. code-block:: bash

    # Change to slimorca_dataset and set train_on_input to True
    tune run lora_finetune_single_device --config my_config.yaml \
    dataset=torchtune.datasets.slimorca_dataset dataset.train_on_input=True

Removing config fields
^^^^^^^^^^^^^^^^^^^^^^
You may need to remove certain parameters from the config when changing components
through overrides that require different keyword arguments. You can do so by using
the `~` flag and specify the dotpath of the config field you would like to remove.
For example, if you want to override a built-in config and use the
`bitsandbytes.optim.PagedAdamW8bit <https://huggingface.co/docs/bitsandbytes/main/en/reference/optim/adamw#bitsandbytes.optim.PagedAdamW8bit>`_
optimizer, you may need to delete parameters like ``foreach`` which are
specific to PyTorch optimizers. Note that this example requires that you have `bitsandbytes <https://github.com/bitsandbytes-foundation/bitsandbytes>`_
installed.

.. code-block:: yaml

    # In configs/llama3/8B_full.yaml
    optimizer:
      _component_: torch.optim.AdamW
      lr: 2e-5
      foreach: False

.. code-block:: bash

    # Change to PagedAdamW8bit and remove fused, foreach
    tune run --nproc_per_node 4 full_finetune_distributed --config llama3/8B_full \
    optimizer=bitsandbytes.optim.PagedAdamW8bit ~optimizer.foreach