.. _packing_usage_label:

==============
Sample packing
==============

Sample packing involves concatenating multiple samples from your dataset into a single sequence, upto a maximum
sequence length. This requires some pre-processing of the dataset which may
slow down time-to-first-batch, but can introduce significant training speedups
depending on the dataset. In torchtune, sample packing is done by iterating through your dataset and performing
greedy packing upon dataset initialization. You can use sample packing with any of the single dataset builders by passing in
:code:`packed=True`.

To set the max sequence length to pack to, make sure to define ``max_seq_len`` on your tokenizer.

.. code-block:: python

    from torchtune.datasets import alpaca_dataset, PackedDataset
    from torchtune.models.llama3 import llama3_tokenizer

    # Load in tokenizer
    tokenizer = llama3_tokenizer(
        path="/tmp/Llama-3.2-1B-Instruct/original/tokenizer.model",
        max_seq_len=8192,
    )
    dataset = alpaca_dataset(
        tokenizer=tokenizer,
        packed=True,
    )
    print(isinstance(dataset, PackedDataset))  # True

.. code-block:: yaml

    # YAML config
    tokenizer:
      _component_: torchtune.models.llama3.llama3_tokenizer
      path: /tmp/Llama-3.2-1B-Instruct/original/tokenizer.model
      max_seq_len: 8192

    dataset:
      _component_: torchtune.datasets.alpaca_dataset
      packed: True

.. code-block:: bash

    # Command line
    tune run full_finetune_single_device --config llama3_2/1B_full_single_device \
    dataset.packed=True tokenizer.max_seq_len=8192

torchtune will automatically handle document masking and relative position IDs when sample packing is enabled
to prevent different irrelevant samples from cross-attending. This is done via PyTorch's `Flex Attention <https://pytorch.org/blog/flexattention/#document-maskingjagged-sequences>`_,
which enables the use of flash attention with non-causal masks. If your hardware does not support Flex Attention
(for CUDA devices, it must be Turing or above), standard SDPA with memory-efficient attention will be used as a fallback,
while retaining the document masking and relative position IDs.