.. DO NOT EDIT.
.. THIS FILE WAS AUTOMATICALLY GENERATED BY SPHINX-GALLERY.
.. TO MAKE CHANGES, EDIT THE SOURCE PYTHON FILE:
.. "intermediate/optimizer_step_in_backward_tutorial.py"
.. LINE NUMBERS ARE GIVEN BELOW.

.. only:: html

    .. note::
        :class: sphx-glr-download-link-note

        Click :ref:`here <sphx_glr_download_intermediate_optimizer_step_in_backward_tutorial.py>`
        to download the full example code

.. rst-class:: sphx-glr-example-title

.. _sphx_glr_intermediate_optimizer_step_in_backward_tutorial.py:


How to save memory by fusing the optimizer step into the backward pass
======================================================================

Hello there! This tutorial aims to showcase one way of reducing the
memory footprint of a training loop by reducing the memory taken by
the *gradients*. Say you have a model and you're interested in ways to
optimize memory to avoid ``Out of Memory`` (OOM) errors or simply to ooze
more out of your GPU. Well, you _might_ be in luck (if gradients take up
a portion of your memory and you do not need to do gradient accumulation).
We will explore the following:

1. What takes up memory during your training or finetuning loop,
2. How to capture and visualize memory snapshots to determine the bottleneck,
3. The new ``Tensor.register_post_accumulate_grad_hook(hook)`` API, and finally,
4. How everything fits together in 10 lines to achieve memory savings.

To run this tutorial, you will need:

*  PyTorch 2.1.0 or newer with ``torchvision``
*  1 CUDA GPU if you'd like to run the memory visualizations locally.
   Otherwise, this technique would benefit similarly on any device.

Let us start by importing the required modules and models. We will use a
vision transformer model from torchvision, but feel free to substitute
with your own model. We will also use ``torch.optim.Adam`` as our optimizer,
but, again, feel free to substitute with your own optimizer.

.. GENERATED FROM PYTHON SOURCE LINES 31-39

.. code-block:: default


    import torch
    from torchvision import models
    from pickle import dump

    model = models.vit_l_16(weights='DEFAULT').cuda()
    optimizer = torch.optim.Adam(model.parameters())


.. rst-class:: sphx-glr-script-out

 .. code-block:: none

    Downloading: "https://download.pytorch.org/models/vit_l_16-852ce7e3.pth" to /var/lib/ci-user/.cache/torch/hub/checkpoints/vit_l_16-852ce7e3.pth

      0%|          | 0.00/1.13G [00:00<?, ?B/s]
      1%|          | 10.6M/1.13G [00:00<00:21, 56.3MB/s]
      1%|1         | 16.6M/1.13G [00:00<00:21, 54.6MB/s]
      3%|2         | 30.1M/1.13G [00:00<00:13, 85.5MB/s]
      3%|3         | 39.1M/1.13G [00:00<00:22, 52.0MB/s]
      4%|4         | 49.2M/1.13G [00:00<00:20, 55.9MB/s]
      5%|5         | 61.9M/1.13G [00:00<00:15, 72.4MB/s]
      6%|6         | 70.5M/1.13G [00:01<00:15, 75.4MB/s]
      7%|6         | 78.9M/1.13G [00:01<00:15, 72.6MB/s]
      7%|7         | 86.6M/1.13G [00:01<00:18, 60.1MB/s]
      8%|8         | 98.4M/1.13G [00:01<00:18, 60.6MB/s]
     10%|9         | 113M/1.13G [00:01<00:14, 75.9MB/s] 
     10%|#         | 121M/1.13G [00:01<00:15, 70.9MB/s]
     11%|#1        | 131M/1.13G [00:02<00:17, 61.2MB/s]
     12%|#1        | 138M/1.13G [00:02<00:19, 54.5MB/s]
     12%|#2        | 143M/1.13G [00:02<00:20, 53.3MB/s]
     13%|#2        | 149M/1.13G [00:02<00:20, 52.9MB/s]
     14%|#4        | 164M/1.13G [00:02<00:15, 66.9MB/s]
     15%|#5        | 174M/1.13G [00:02<00:16, 62.2MB/s]
     16%|#5        | 180M/1.13G [00:03<00:21, 48.7MB/s]
     17%|#6        | 192M/1.13G [00:03<00:16, 62.2MB/s]
     17%|#7        | 200M/1.13G [00:03<00:15, 66.2MB/s]
     18%|#8        | 212M/1.13G [00:03<00:12, 78.4MB/s]
     19%|#8        | 220M/1.13G [00:03<00:13, 72.4MB/s]
     20%|#9        | 228M/1.13G [00:03<00:14, 68.5MB/s]
     20%|##        | 235M/1.13G [00:03<00:15, 64.2MB/s]
     21%|##        | 241M/1.13G [00:04<00:17, 55.2MB/s]
     21%|##1       | 247M/1.13G [00:04<00:19, 49.2MB/s]
     22%|##1       | 253M/1.13G [00:04<00:17, 53.1MB/s]
     23%|##2       | 262M/1.13G [00:04<00:18, 50.4MB/s]
     24%|##3       | 277M/1.13G [00:04<00:13, 70.8MB/s]
     25%|##4       | 284M/1.13G [00:04<00:14, 61.7MB/s]
     25%|##5       | 294M/1.13G [00:05<00:23, 39.2MB/s]
     26%|##5       | 300M/1.13G [00:05<00:22, 39.5MB/s]
     27%|##6       | 311M/1.13G [00:05<00:18, 49.0MB/s]
     28%|##8       | 328M/1.13G [00:05<00:12, 68.6MB/s]
     29%|##9       | 339M/1.13G [00:05<00:13, 62.9MB/s]
     30%|##9       | 346M/1.13G [00:06<00:13, 62.9MB/s]
     31%|###1      | 360M/1.13G [00:06<00:11, 75.0MB/s]
     32%|###2      | 376M/1.13G [00:06<00:12, 67.0MB/s]
     33%|###3      | 384M/1.13G [00:06<00:12, 67.4MB/s]
     34%|###3      | 392M/1.13G [00:06<00:11, 68.7MB/s]
     34%|###4      | 399M/1.13G [00:06<00:13, 61.1MB/s]
     35%|###5      | 410M/1.13G [00:07<00:13, 60.5MB/s]
     37%|###6      | 426M/1.13G [00:07<00:10, 76.1MB/s]
     38%|###8      | 442M/1.13G [00:07<00:08, 85.2MB/s]
     39%|###9      | 453M/1.13G [00:07<00:08, 86.8MB/s]
     40%|###9      | 462M/1.13G [00:07<00:10, 73.0MB/s]
     41%|####      | 474M/1.13G [00:07<00:10, 69.1MB/s]
     41%|####1     | 480M/1.13G [00:08<00:11, 62.3MB/s]
     42%|####1     | 487M/1.13G [00:08<00:11, 60.0MB/s]
     42%|####2     | 492M/1.13G [00:08<00:12, 54.6MB/s]
     43%|####2     | 498M/1.13G [00:08<00:17, 40.5MB/s]
     44%|####3     | 508M/1.13G [00:08<00:13, 51.0MB/s]
     45%|####5     | 523M/1.13G [00:08<00:09, 71.0MB/s]
     46%|####5     | 531M/1.13G [00:08<00:10, 65.7MB/s]
     47%|####6     | 541M/1.13G [00:09<00:09, 69.7MB/s]
     47%|####7     | 549M/1.13G [00:09<00:08, 74.3MB/s]
     48%|####7     | 557M/1.13G [00:09<00:09, 69.9MB/s]
     49%|####9     | 574M/1.13G [00:09<00:07, 85.0MB/s]
     50%|#####     | 584M/1.13G [00:09<00:06, 89.4MB/s]
     51%|#####1    | 593M/1.13G [00:09<00:08, 74.4MB/s]
     52%|#####1    | 600M/1.13G [00:09<00:09, 60.4MB/s]
     52%|#####2    | 607M/1.13G [00:10<00:10, 56.1MB/s]
     53%|#####3    | 621M/1.13G [00:10<00:08, 68.2MB/s]
     54%|#####4    | 628M/1.13G [00:10<00:08, 65.0MB/s]
     55%|#####4    | 635M/1.13G [00:10<00:08, 65.8MB/s]
     55%|#####5    | 641M/1.13G [00:10<00:08, 61.4MB/s]
     56%|#####6    | 654M/1.13G [00:10<00:06, 77.9MB/s]
     57%|#####6    | 662M/1.13G [00:10<00:07, 74.5MB/s]
     58%|#####7    | 670M/1.13G [00:10<00:06, 79.3MB/s]
     58%|#####8    | 678M/1.13G [00:11<00:06, 78.3MB/s]
     59%|#####9    | 688M/1.13G [00:11<00:06, 71.5MB/s]
     60%|######    | 701M/1.13G [00:11<00:05, 87.0MB/s]
     61%|######1   | 710M/1.13G [00:11<00:05, 82.6MB/s]
     62%|######2   | 720M/1.13G [00:11<00:05, 78.3MB/s]
     63%|######2   | 728M/1.13G [00:11<00:06, 74.9MB/s]
     63%|######3   | 737M/1.13G [00:11<00:06, 69.6MB/s]
     64%|######4   | 744M/1.13G [00:12<00:06, 66.2MB/s]
     65%|######4   | 752M/1.13G [00:12<00:06, 66.3MB/s]
     65%|######5   | 759M/1.13G [00:12<00:06, 62.5MB/s]
     66%|######6   | 770M/1.13G [00:12<00:06, 65.6MB/s]
     68%|######7   | 786M/1.13G [00:12<00:04, 80.7MB/s]
     68%|######8   | 794M/1.13G [00:12<00:05, 66.4MB/s]
     69%|######8   | 800M/1.13G [00:12<00:06, 56.3MB/s]
     70%|######9   | 808M/1.13G [00:13<00:06, 60.3MB/s]
     70%|#######   | 818M/1.13G [00:13<00:05, 69.8MB/s]
     71%|#######1  | 825M/1.13G [00:13<00:05, 65.4MB/s]
     72%|#######1  | 835M/1.13G [00:13<00:04, 75.8MB/s]
     73%|#######2  | 843M/1.13G [00:13<00:04, 68.1MB/s]
     73%|#######3  | 851M/1.13G [00:13<00:04, 68.8MB/s]
     74%|#######3  | 858M/1.13G [00:13<00:06, 47.2MB/s]
     74%|#######4  | 864M/1.13G [00:14<00:06, 48.2MB/s]
     75%|#######4  | 869M/1.13G [00:14<00:07, 41.4MB/s]
     76%|#######6  | 885M/1.13G [00:14<00:04, 58.1MB/s]
     77%|#######7  | 900M/1.13G [00:14<00:04, 63.6MB/s]
     78%|#######8  | 906M/1.13G [00:14<00:04, 54.7MB/s]
     79%|#######8  | 916M/1.13G [00:15<00:04, 58.8MB/s]
     79%|#######9  | 922M/1.13G [00:15<00:04, 54.4MB/s]
     80%|#######9  | 928M/1.13G [00:15<00:04, 56.8MB/s]
     80%|########  | 934M/1.13G [00:15<00:04, 49.2MB/s]
     81%|########1 | 944M/1.13G [00:15<00:04, 47.1MB/s]
     82%|########1 | 950M/1.13G [00:15<00:04, 44.8MB/s]
     83%|########2 | 962M/1.13G [00:15<00:03, 59.9MB/s]
     84%|########4 | 977M/1.13G [00:16<00:02, 80.8MB/s]
     85%|########4 | 986M/1.13G [00:16<00:02, 83.4MB/s]
     86%|########5 | 998M/1.13G [00:16<00:02, 80.4MB/s]
     87%|########6 | 0.98G/1.13G [00:16<00:02, 68.7MB/s]
     87%|########7 | 0.99G/1.13G [00:16<00:02, 71.3MB/s]
     88%|########7 | 1.00G/1.13G [00:16<00:02, 69.2MB/s]
     89%|########8 | 1.00G/1.13G [00:16<00:02, 58.8MB/s]
     89%|########9 | 1.01G/1.13G [00:17<00:02, 57.9MB/s]
     90%|######### | 1.02G/1.13G [00:17<00:01, 68.1MB/s]
     91%|#########1| 1.03G/1.13G [00:17<00:01, 77.6MB/s]
     92%|#########1| 1.04G/1.13G [00:17<00:01, 67.2MB/s]
     93%|#########3| 1.06G/1.13G [00:17<00:01, 77.7MB/s]
     94%|#########4| 1.07G/1.13G [00:17<00:00, 82.7MB/s]
     95%|#########4| 1.07G/1.13G [00:17<00:01, 60.7MB/s]
     96%|#########6| 1.09G/1.13G [00:18<00:00, 78.8MB/s]
     97%|#########7| 1.10G/1.13G [00:18<00:00, 94.8MB/s]
     98%|#########8| 1.11G/1.13G [00:18<00:00, 67.8MB/s]
     99%|#########8| 1.12G/1.13G [00:18<00:00, 49.4MB/s]
    100%|#########9| 1.13G/1.13G [00:19<00:00, 49.8MB/s]
    100%|##########| 1.13G/1.13G [00:19<00:00, 63.9MB/s]


.. GENERATED FROM PYTHON SOURCE LINES 40-43

Now let's define our typical training loop. You should use real images when
training, but for the purposes of this tutorial, we are passing in fake
inputs and not worrying about loading any actual data.

.. GENERATED FROM PYTHON SOURCE LINES 43-58

.. code-block:: default


    IMAGE_SIZE = 224

    def train(model, optimizer):
      # create our fake image input: tensor shape is batch_size, channels, height, width
      fake_image = torch.rand(1, 3, IMAGE_SIZE, IMAGE_SIZE).cuda()

      # call our forward and backward
      loss = model.forward(fake_image)
      loss.sum().backward()

      # optimizer update
      optimizer.step()
      optimizer.zero_grad()


.. GENERATED FROM PYTHON SOURCE LINES 59-76

Memory usage during training
""""""""""""""""""""""""""""
We are about to look at some memory snapshots, so we should be prepared to
analyze them properly. Typically, training memory consists of:

 * Model parameters (size P)
 * Activations that are saved for the backward pass (size A)
 * Gradients, which are the same size as the model parameters, so size G = P.
 * Optimizer state, which is proportional to the size of the parameters. In
   this case, the state for Adam requires 2x the model parameters, so size O = 2P.
 * Intermediate tensors, which are allocated throughout the compute. We will
   not worry about them for now as they are usually small and ephemeral.

Capturing and visualizing memory snapshots
""""""""""""""""""""""""""""""""""""""""""
Let's get us a memory snapshot! As your code runs, consider what you may expect
the CUDA memory timeline to look like.

.. GENERATED FROM PYTHON SOURCE LINES 76-92

.. code-block:: default


    # tell CUDA to start recording memory allocations
    torch.cuda.memory._record_memory_history(enabled='all')

    # train 3 steps
    for _ in range(3):
      train(model, optimizer)

    # save a snapshot of the memory allocations
    s = torch.cuda.memory._snapshot()
    with open(f"snapshot.pickle", "wb") as f:
        dump(s, f)

    # tell CUDA to stop recording memory allocations now
    torch.cuda.memory._record_memory_history(enabled=None)


.. GENERATED FROM PYTHON SOURCE LINES 93-170

Now open up the snapshot in the CUDA Memory Visualizer at
https://pytorch.org/memory_viz by dragging and dropping the
``snapshot.pickle`` file. Does the memory timeline match your expectations?

.. figure:: /_static/img/optim_step_in_bwd/snapshot.jpg
   :alt: snapshot.png loaded into CUDA Memory Visualizer

The model parameters have already been loaded in memory before the training
step, so we see a chunk of memory devoted to the weights right off the bat.
As we start our forward pass, memory is allocated gradually for the activations,
or the tensors we are saving to be able to compute gradients in the backward pass.
Once we start the backward pass, the activations are gradually freed while memory
of the gradients starts building up.

Lastly, as the optimizer kicks in, its state will be lazily initialized, so we 
should see the optimizer state memory gradually increase during the optimizer
step of the first training loop only. In future loops, the optimizer memory
will remain and be updated in-place. The memory for the gradients is then
freed accordingly at the end of every training loop when ``zero_grad`` is called.

Where is the memory bottleneck in this training loop? Or, in other words,
where is the peak memory?

The peak memory usage is during the optimizer step! Note the memory then
consists of ~1.2GB of parameters, ~1.2GB of gradients, and ~2.4GB=2*1.2GB of
the optimizer state as expected. The last ~1.2GB comes from Adam optimizer
requiring memory for intermediates, totaling to ~6GB of peak memory.
Technically, you can remove the need for the last 1.2GB for optimizer
intermediates if you set ``Adam(model.parameters(), foreach=False)`` which
would trade off runtime for memory. If switching off the ``foreach`` runtime
optimization is sufficient in memory savings for you, nice, but please
read on if you're curious how this tutorial can help you do better!
With the technique we will soon introduce, we will reduce peak memory by
removing the need for the ~1.2GB of **gradients memory** as well as **optimizer
intermediates memory**. Now, what would you expect the new peak memory to be?
The answer will be revealed in the `next` snapshot.

DISCLAIMER: This technique is **not** for all
"""""""""""""""""""""""""""""""""""""""""""""
Before we get too excited, we have to consider whether this technique is applicable
for `your` use case. This is NOT a silver bullet! The technique of fusing the 
optimizer step into the backward only targets reducing *gradient* memory (and as a side effect also optimizer intermediates
memory). Thus, the more sizable the memory taken up by the gradients, the more
tantamount the memory reduction. In our example above, the gradients eat up 20% 
of the memory pie, which is quite sizable!

This may not be the case for you, for example, if your weights are already tiny,
(say, due to applying LoRa,) then the gradients do not take much space in your
training loop and the wins are way less exciting. In that case, you should
first try other techniques like activations checkpointing, distributed
training, quantization, or reducing the batch size. Then, when the gradients
are part of the bottleneck again, come back to this tutorial!

Still here? Cool, let's introduce our new ``register_post_accumulate_grad_hook(hook)``
API on Tensor.

``Tensor.register_post_accumulate_grad_hook(hook)`` API and our technique
""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""
Our technique relies on not having to save the gradients during ``backward()``. Instead,
once a gradient has been accumulated, we will immediately apply the optimizer to
the corresponding parameter and drop that gradient entirely! This removes the need
for holding onto a big buffer of gradients until the optimizer step.

So how can we unlock the behavior of applying the optimizer more eagerly? In our 2.1
release, we've added a new API :func:`torch.Tensor.register_post_accumulate_grad_hook`
that would allow us to add a hook onto a Tensor once its ``.grad`` field has been
accumulated. We will encapsulate the optimizer step into this hook. How?

How everything fits together in 10 lines
""""""""""""""""""""""""""""""""""""""""
Remember our model and optimizer setup from the beginning? I'll leave them commented
out below so we don't spend resources rerunning the code.

.. code-block:: python

   model = models.vit_l_16(weights='DEFAULT').cuda()
   optimizer = torch.optim.Adam(model.parameters())

.. GENERATED FROM PYTHON SOURCE LINES 170-198

.. code-block:: default


    # Instead of having just *one* optimizer, we will have a ``dict`` of optimizers
    # for every parameter so we could reference them in our hook.
    optimizer_dict = {p: torch.optim.Adam([p], foreach=False) for p in model.parameters()}

    # Define our hook, which will call the optimizer ``step()`` and ``zero_grad()``
    def optimizer_hook(parameter) -> None:
      optimizer_dict[parameter].step()
      optimizer_dict[parameter].zero_grad()

    # Register the hook onto every parameter
    for p in model.parameters():
       p.register_post_accumulate_grad_hook(optimizer_hook)

    # Now remember our previous ``train()`` function? Since the optimizer has been
    # fused into the backward, we can remove the optimizer step and zero_grad calls.
    def train(model):
      # create our fake image input: tensor shape is batch_size, channels, height, width
      fake_image = torch.rand(1, 3, IMAGE_SIZE, IMAGE_SIZE).cuda()

      # call our forward and backward
      loss = model.forward(fake_image)
      loss.sum().backward()

      # optimizer update --> no longer needed!
      # optimizer.step()
      # optimizer.zero_grad()


.. GENERATED FROM PYTHON SOURCE LINES 199-211

That took about 10 lines of changes in our sample model, which is neat.
However, for real models, it could be a fairly intrusive change to switch
out the optimizer for an optimizer dictionary, especially for those who use
``LRScheduler``s or manipulate optimizer configuration throughout the
training epochs. Working out this API with those changes will be more
involved and will likely require moving more configuration into global
state but should not be impossible. That said, a next step for PyTorch
is to make this API easier to adopt with LRSchedulers and other features
you are already used to.

But let me get back to convincing you that this technique is worth it.
We will consult our friend, the memory snapshot.

.. GENERATED FROM PYTHON SOURCE LINES 211-231

.. code-block:: default


    # delete optimizer memory from before to get a clean slate for the next
    # memory snapshot
    del optimizer

    # tell CUDA to start recording memory allocations
    torch.cuda.memory._record_memory_history(enabled='all')

    # train 3 steps. note that we no longer pass the optimizer into train()
    for _ in range(3):
      train(model)

    # save a snapshot of the memory allocations
    s = torch.cuda.memory._snapshot()
    with open(f"snapshot-opt-in-bwd.pickle", "wb") as f:
        dump(s, f)

    # tell CUDA to stop recording memory allocations now
    torch.cuda.memory._record_memory_history(enabled=None)


.. GENERATED FROM PYTHON SOURCE LINES 232-269

Yes, take some time to drag your snapshot into the CUDA Memory Visualizer.

.. figure:: /_static/img/optim_step_in_bwd/snapshot_opt_in_bwd.jpg
   :alt: snapshot.png loaded into CUDA Memory Visualizer

Several major observations:
 1. There is no more optimizer step! Right...we fused that into the backward.
 2. Likewise, the backward drags longer and there are more random allocations
    for intermediates. This is expected, as the optimizer step requires 
    intermediates.
 3. Most importantly! The peak memory is lower! It is now ~4GB (which I
    hope maps closely to your earlier expectation). 

Note that there is no longer any big chunk of memory allocated for the gradients
compared to before, accounting for ~1.2GB of memory savings. Instead, we've freed
each gradient very quickly after they've been computed by moving the optimizer 
step as far ahead as we can. Woohoo! By the way, the other ~1.2GB of memory savings
comes from breaking apart the optimizer into per-parameter optimizers, so the
intermediates have proportionally shrunk. This detail is `less important` than
the gradient memory savings, as you can get optimizer intermediates savings
from just turning ``foreach=False`` without this technique.

You may be correctly wondering: if we saved 2.4GB of memory, why is the peak memory
NOT 6GB - 2.4GB = 3.6GB? Well, the peak has moved! The peak is now near the start
of the backward step, when we still have activations in memory, where before, the peak
was during the optimizer step when the activations had been freed. The ~0.4GB difference
accounting for ~4.0GB - ~3.6GB is thus due to the activations memory. One can then
imagine that this technique can be coupled with activations checkpointing for more
memory wins.

Conclusion
""""""""""
In this tutorial, we learned about the memory saving technique of
fusing the optimizer into the backward step through the new
``Tensor.register_post_accumulate_grad_hook()`` API and *when* to apply this
technique (when gradients memory is significant). Along the way, we also learned
about memory snapshots, which are generally useful in memory optimization.


.. rst-class:: sphx-glr-timing

   **Total running time of the script:** ( 0 minutes  28.538 seconds)


.. _sphx_glr_download_intermediate_optimizer_step_in_backward_tutorial.py:

.. only:: html

  .. container:: sphx-glr-footer sphx-glr-footer-example


    .. container:: sphx-glr-download sphx-glr-download-python

      :download:`Download Python source code: optimizer_step_in_backward_tutorial.py <optimizer_step_in_backward_tutorial.py>`

    .. container:: sphx-glr-download sphx-glr-download-jupyter

      :download:`Download Jupyter notebook: optimizer_step_in_backward_tutorial.ipynb <optimizer_step_in_backward_tutorial.ipynb>`


.. only:: html

 .. rst-class:: sphx-glr-signature

    `Gallery generated by Sphinx-Gallery <https://sphinx-gallery.github.io>`_