.. DO NOT EDIT.
.. THIS FILE WAS AUTOMATICALLY GENERATED BY SPHINX-GALLERY.
.. TO MAKE CHANGES, EDIT THE SOURCE PYTHON FILE:
.. "beginner/blitz/autograd_tutorial.py"
.. LINE NUMBERS ARE GIVEN BELOW.

.. only:: html

    .. note::
        :class: sphx-glr-download-link-note

        Click :ref:`here <sphx_glr_download_beginner_blitz_autograd_tutorial.py>`
        to download the full example code

.. rst-class:: sphx-glr-example-title

.. _sphx_glr_beginner_blitz_autograd_tutorial.py:


A Gentle Introduction to ``torch.autograd``
===========================================

``torch.autograd`` is PyTorch’s automatic differentiation engine that powers
neural network training. In this section, you will get a conceptual
understanding of how autograd helps a neural network train.

Background
~~~~~~~~~~
Neural networks (NNs) are a collection of nested functions that are
executed on some input data. These functions are defined by *parameters*
(consisting of weights and biases), which in PyTorch are stored in
tensors.

Training a NN happens in two steps:

**Forward Propagation**: In forward prop, the NN makes its best guess
about the correct output. It runs the input data through each of its
functions to make this guess.

**Backward Propagation**: In backprop, the NN adjusts its parameters
proportionate to the error in its guess. It does this by traversing
backwards from the output, collecting the derivatives of the error with
respect to the parameters of the functions (*gradients*), and optimizing
the parameters using gradient descent. For a more detailed walkthrough
of backprop, check out this `video from
3Blue1Brown <https://www.youtube.com/watch?v=tIeHLnjs5U8>`__.


Usage in PyTorch
~~~~~~~~~~~~~~~~
Let's take a look at a single training step.
For this example, we load a pretrained resnet18 model from ``torchvision``.
We create a random data tensor to represent a single image with 3 channels, and height & width of 64,
and its corresponding ``label`` initialized to some random values. Label in pretrained models has
shape (1,1000).

.. note::
    This tutorial works only on the CPU and will not work on GPU devices (even if tensors are moved to CUDA).

.. GENERATED FROM PYTHON SOURCE LINES 46-52

.. code-block:: default

    import torch
    from torchvision.models import resnet18, ResNet18_Weights
    model = resnet18(weights=ResNet18_Weights.DEFAULT)
    data = torch.rand(1, 3, 64, 64)
    labels = torch.rand(1, 1000)


.. rst-class:: sphx-glr-script-out

 .. code-block:: none

    Downloading: "https://download.pytorch.org/models/resnet18-f37072fd.pth" to /var/lib/ci-user/.cache/torch/hub/checkpoints/resnet18-f37072fd.pth

      0%|          | 0.00/44.7M [00:00<?, ?B/s]
     30%|##9       | 13.2M/44.7M [00:00<00:00, 138MB/s]
     60%|######    | 27.0M/44.7M [00:00<00:00, 142MB/s]
     92%|#########1| 41.0M/44.7M [00:00<00:00, 144MB/s]
    100%|##########| 44.7M/44.7M [00:00<00:00, 143MB/s]


.. GENERATED FROM PYTHON SOURCE LINES 53-56

Next, we run the input data through the model through each of its layers to make a prediction.
This is the **forward pass**.


.. GENERATED FROM PYTHON SOURCE LINES 56-59

.. code-block:: default


    prediction = model(data) # forward pass


.. GENERATED FROM PYTHON SOURCE LINES 60-65

We use the model's prediction and the corresponding label to calculate the error (``loss``).
The next step is to backpropagate this error through the network.
Backward propagation is kicked off when we call ``.backward()`` on the error tensor.
Autograd then calculates and stores the gradients for each model parameter in the parameter's ``.grad`` attribute.


.. GENERATED FROM PYTHON SOURCE LINES 65-69

.. code-block:: default


    loss = (prediction - labels).sum()
    loss.backward() # backward pass


.. GENERATED FROM PYTHON SOURCE LINES 70-73

Next, we load an optimizer, in this case SGD with a learning rate of 0.01 and `momentum <https://towardsdatascience.com/stochastic-gradient-descent-with-momentum-a84097641a5d>`__ of 0.9.
We register all the parameters of the model in the optimizer.


.. GENERATED FROM PYTHON SOURCE LINES 73-76

.. code-block:: default


    optim = torch.optim.SGD(model.parameters(), lr=1e-2, momentum=0.9)


.. GENERATED FROM PYTHON SOURCE LINES 77-79

Finally, we call ``.step()`` to initiate gradient descent. The optimizer adjusts each parameter by its gradient stored in ``.grad``.


.. GENERATED FROM PYTHON SOURCE LINES 79-82

.. code-block:: default


    optim.step() #gradient descent


.. GENERATED FROM PYTHON SOURCE LINES 83-86

At this point, you have everything you need to train your neural network.
The below sections detail the workings of autograd - feel free to skip them.


.. GENERATED FROM PYTHON SOURCE LINES 89-91

--------------


.. GENERATED FROM PYTHON SOURCE LINES 94-99

Differentiation in Autograd
~~~~~~~~~~~~~~~~~~~~~~~~~~~
Let's take a look at how ``autograd`` collects gradients. We create two tensors ``a`` and ``b`` with
``requires_grad=True``. This signals to ``autograd`` that every operation on them should be tracked.


.. GENERATED FROM PYTHON SOURCE LINES 99-105

.. code-block:: default


    import torch

    a = torch.tensor([2., 3.], requires_grad=True)
    b = torch.tensor([6., 4.], requires_grad=True)


.. GENERATED FROM PYTHON SOURCE LINES 106-110

We create another tensor ``Q`` from ``a`` and ``b``.

.. math::
   Q = 3a^3 - b^2

.. GENERATED FROM PYTHON SOURCE LINES 110-114

.. code-block:: default


    Q = 3*a**3 - b**2


.. GENERATED FROM PYTHON SOURCE LINES 115-138

Let's assume ``a`` and ``b`` to be parameters of an NN, and ``Q``
to be the error. In NN training, we want gradients of the error
w.r.t. parameters, i.e.

.. math::
   \frac{\partial Q}{\partial a} = 9a^2

.. math::
   \frac{\partial Q}{\partial b} = -2b


When we call ``.backward()`` on ``Q``, autograd calculates these gradients
and stores them in the respective tensors' ``.grad`` attribute.

We need to explicitly pass a ``gradient`` argument in ``Q.backward()`` because it is a vector.
``gradient`` is a tensor of the same shape as ``Q``, and it represents the
gradient of Q w.r.t. itself, i.e.

.. math::
   \frac{dQ}{dQ} = 1

Equivalently, we can also aggregate Q into a scalar and call backward implicitly, like ``Q.sum().backward()``.


.. GENERATED FROM PYTHON SOURCE LINES 138-142

.. code-block:: default

    external_grad = torch.tensor([1., 1.])
    Q.backward(gradient=external_grad)


.. GENERATED FROM PYTHON SOURCE LINES 143-144

Gradients are now deposited in ``a.grad`` and ``b.grad``

.. GENERATED FROM PYTHON SOURCE LINES 144-150

.. code-block:: default


    # check if collected gradients are correct
    print(9*a**2 == a.grad)
    print(-2*b == b.grad)


.. rst-class:: sphx-glr-script-out

 .. code-block:: none

    tensor([True, True])
    tensor([True, True])


.. GENERATED FROM PYTHON SOURCE LINES 151-211

Optional Reading - Vector Calculus using ``autograd``
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^

Mathematically, if you have a vector valued function
:math:`\vec{y}=f(\vec{x})`, then the gradient of :math:`\vec{y}` with
respect to :math:`\vec{x}` is a Jacobian matrix :math:`J`:

.. math::


     J
     =
      \left(\begin{array}{cc}
      \frac{\partial \bf{y}}{\partial x_{1}} &
      ... &
      \frac{\partial \bf{y}}{\partial x_{n}}
      \end{array}\right)
     =
     \left(\begin{array}{ccc}
      \frac{\partial y_{1}}{\partial x_{1}} & \cdots & \frac{\partial y_{1}}{\partial x_{n}}\\
      \vdots & \ddots & \vdots\\
      \frac{\partial y_{m}}{\partial x_{1}} & \cdots & \frac{\partial y_{m}}{\partial x_{n}}
      \end{array}\right)

Generally speaking, ``torch.autograd`` is an engine for computing
vector-Jacobian product. That is, given any vector :math:`\vec{v}`, compute the product
:math:`J^{T}\cdot \vec{v}`

If :math:`\vec{v}` happens to be the gradient of a scalar function :math:`l=g\left(\vec{y}\right)`:

.. math::


  \vec{v}
   =
   \left(\begin{array}{ccc}\frac{\partial l}{\partial y_{1}} & \cdots & \frac{\partial l}{\partial y_{m}}\end{array}\right)^{T}

then by the chain rule, the vector-Jacobian product would be the
gradient of :math:`l` with respect to :math:`\vec{x}`:

.. math::


     J^{T}\cdot \vec{v}=\left(\begin{array}{ccc}
      \frac{\partial y_{1}}{\partial x_{1}} & \cdots & \frac{\partial y_{m}}{\partial x_{1}}\\
      \vdots & \ddots & \vdots\\
      \frac{\partial y_{1}}{\partial x_{n}} & \cdots & \frac{\partial y_{m}}{\partial x_{n}}
      \end{array}\right)\left(\begin{array}{c}
      \frac{\partial l}{\partial y_{1}}\\
      \vdots\\
      \frac{\partial l}{\partial y_{m}}
      \end{array}\right)=\left(\begin{array}{c}
      \frac{\partial l}{\partial x_{1}}\\
      \vdots\\
      \frac{\partial l}{\partial x_{n}}
      \end{array}\right)

This characteristic of vector-Jacobian product is what we use in the above example;
``external_grad`` represents :math:`\vec{v}`.


.. GENERATED FROM PYTHON SOURCE LINES 215-263

Computational Graph
~~~~~~~~~~~~~~~~~~~

Conceptually, autograd keeps a record of data (tensors) & all executed
operations (along with the resulting new tensors) in a directed acyclic
graph (DAG) consisting of
`Function <https://pytorch.org/docs/stable/autograd.html#torch.autograd.Function>`__
objects. In this DAG, leaves are the input tensors, roots are the output
tensors. By tracing this graph from roots to leaves, you can
automatically compute the gradients using the chain rule.

In a forward pass, autograd does two things simultaneously:

- run the requested operation to compute a resulting tensor, and
- maintain the operation’s *gradient function* in the DAG.

The backward pass kicks off when ``.backward()`` is called on the DAG
root. ``autograd`` then:

- computes the gradients from each ``.grad_fn``,
- accumulates them in the respective tensor’s ``.grad`` attribute, and
- using the chain rule, propagates all the way to the leaf tensors.

Below is a visual representation of the DAG in our example. In the graph,
the arrows are in the direction of the forward pass. The nodes represent the backward functions
of each operation in the forward pass. The leaf nodes in blue represent our leaf tensors ``a`` and ``b``.

.. figure:: /_static/img/dag_autograd.png

.. note::
  **DAGs are dynamic in PyTorch**
  An important thing to note is that the graph is recreated from scratch; after each
  ``.backward()`` call, autograd starts populating a new graph. This is
  exactly what allows you to use control flow statements in your model;
  you can change the shape, size and operations at every iteration if
  needed.

Exclusion from the DAG
^^^^^^^^^^^^^^^^^^^^^^

``torch.autograd`` tracks operations on all tensors which have their
``requires_grad`` flag set to ``True``. For tensors that don’t require
gradients, setting this attribute to ``False`` excludes it from the
gradient computation DAG.

The output tensor of an operation will require gradients even if only a
single input tensor has ``requires_grad=True``.


.. GENERATED FROM PYTHON SOURCE LINES 263-274

.. code-block:: default


    x = torch.rand(5, 5)
    y = torch.rand(5, 5)
    z = torch.rand((5, 5), requires_grad=True)

    a = x + y
    print(f"Does `a` require gradients? : {a.requires_grad}")
    b = x + z
    print(f"Does `b` require gradients?: {b.requires_grad}")


.. rst-class:: sphx-glr-script-out

 .. code-block:: none

    Does `a` require gradients? : False
    Does `b` require gradients?: True


.. GENERATED FROM PYTHON SOURCE LINES 275-281

In a NN, parameters that don't compute gradients are usually called **frozen parameters**.
It is useful to "freeze" part of your model if you know in advance that you won't need the gradients of those parameters
(this offers some performance benefits by reducing autograd computations).

In finetuning, we freeze most of the model and typically only modify the classifier layers to make predictions on new labels.
Let's walk through a small example to demonstrate this. As before, we load a pretrained resnet18 model, and freeze all the parameters.

.. GENERATED FROM PYTHON SOURCE LINES 281-290

.. code-block:: default


    from torch import nn, optim

    model = resnet18(weights=ResNet18_Weights.DEFAULT)

    # Freeze all the parameters in the network
    for param in model.parameters():
        param.requires_grad = False


.. GENERATED FROM PYTHON SOURCE LINES 291-295

Let's say we want to finetune the model on a new dataset with 10 labels.
In resnet, the classifier is the last linear layer ``model.fc``.
We can simply replace it with a new linear layer (unfrozen by default)
that acts as our classifier.

.. GENERATED FROM PYTHON SOURCE LINES 295-298

.. code-block:: default


    model.fc = nn.Linear(512, 10)


.. GENERATED FROM PYTHON SOURCE LINES 299-301

Now all parameters in the model, except the parameters of ``model.fc``, are frozen.
The only parameters that compute gradients are the weights and bias of ``model.fc``.

.. GENERATED FROM PYTHON SOURCE LINES 301-305

.. code-block:: default


    # Optimize only the classifier
    optimizer = optim.SGD(model.parameters(), lr=1e-2, momentum=0.9)


.. GENERATED FROM PYTHON SOURCE LINES 306-313

Notice although we register all the parameters in the optimizer,
the only parameters that are computing gradients (and hence updated in gradient descent)
are the weights and bias of the classifier.

The same exclusionary functionality is available as a context manager in
`torch.no_grad() <https://pytorch.org/docs/stable/generated/torch.no_grad.html>`__


.. GENERATED FROM PYTHON SOURCE LINES 315-317

--------------


.. GENERATED FROM PYTHON SOURCE LINES 319-325

Further readings:
~~~~~~~~~~~~~~~~~~~

-  `In-place operations & Multithreaded Autograd <https://pytorch.org/docs/stable/notes/autograd.html>`__
-  `Example implementation of reverse-mode autodiff <https://colab.research.google.com/drive/1VpeE6UvEPRz9HmsHh1KS0XxXjYu533EC>`__
-  `Video: PyTorch Autograd Explained - In-depth Tutorial <https://www.youtube.com/watch?v=MswxJw-8PvE>`__


.. rst-class:: sphx-glr-timing

   **Total running time of the script:** ( 0 minutes  0.875 seconds)


.. _sphx_glr_download_beginner_blitz_autograd_tutorial.py:

.. only:: html

  .. container:: sphx-glr-footer sphx-glr-footer-example


    .. container:: sphx-glr-download sphx-glr-download-python

      :download:`Download Python source code: autograd_tutorial.py <autograd_tutorial.py>`

    .. container:: sphx-glr-download sphx-glr-download-jupyter

      :download:`Download Jupyter notebook: autograd_tutorial.ipynb <autograd_tutorial.ipynb>`


.. only:: html

 .. rst-class:: sphx-glr-signature

    `Gallery generated by Sphinx-Gallery <https://sphinx-gallery.github.io>`_