Automatic differentiation package - torch.autograd

torch.autograd provides classes and functions implementing automatic differentiation of arbitrary scalar valued functions. It requires minimal changes to the existing code - you only need to declare Tensor s for which gradients should be computed with the requires_grad=True keyword. As of now, we only support autograd for floating point Tensor types ( half, float, double and bfloat16) and complex Tensor types (cfloat, cdouble).

backward

Computes the sum of gradients of given tensors with respect to graph leaves.

grad

Computes and returns the sum of gradients of outputs with respect to the inputs.

Functional higher level API

Warning

This API is in beta. Even though the function signatures are very unlikely to change, major improvements to performances are planned before we consider this stable.

This section contains the higher level API for the autograd that builds on the basic API above and allows you to compute jacobians, hessians, etc.

This API works with user-provided functions that take only Tensors as input and return only Tensors. If your function takes other arguments that are not Tensors or Tensors that don’t have requires_grad set, you can use a lambda to capture them. For example, for a function f that takes three inputs, a Tensor for which we want the jacobian, another tensor that should be considered constant and a boolean flag as f(input, constant, flag=flag) you can use it as functional.jacobian(lambda x: f(x, constant, flag=flag), input).

`functional.jacobian`	Function that computes the Jacobian of a given function.
`functional.hessian`	Function that computes the Hessian of a given scalar function.
`functional.vjp`	Function that computes the dot product between a vector `v` and the Jacobian of the given function at the point given by the inputs.
`functional.jvp`	Function that computes the dot product between the Jacobian of the given function at the point given by the inputs and a vector `v`.
`functional.vhp`	Function that computes the dot product between a vector `v` and the Hessian of a given scalar function at the point given by the inputs.
`functional.hvp`	Function that computes the dot product between the Hessian of a given scalar function and a vector `v` at the point given by the inputs.

Locally disabling gradient computation

See Locally disabling gradient computation for more information on the differences between no-grad and inference mode as well as other related mechanisms that may be confused with the two.

`no_grad`	Context-manager that disabled gradient calculation.
`enable_grad`	Context-manager that enables gradient calculation.
`set_grad_enabled`	Context-manager that sets gradient calculation to on or off.
`inference_mode`	Context-manager that enables or disables inference mode

Default gradient layouts

When a non-sparse param receives a non-sparse gradient during torch.autograd.backward() or torch.Tensor.backward() param.grad is accumulated as follows.

If param.grad is initially None:

If param’s memory is non-overlapping and dense, .grad is created with strides matching param (thus matching param’s layout).
Otherwise, .grad is created with rowmajor-contiguous strides.

If param already has a non-sparse .grad attribute:

If create_graph=False, backward() accumulates into .grad in-place, which preserves its strides.
If create_graph=True, backward() replaces .grad with a new tensor .grad + new grad, which attempts (but does not guarantee) matching the preexisting .grad’s strides.

The default behavior (letting .grads be None before the first backward(), such that their layout is created according to 1 or 2, and retained over time according to 3 or 4) is recommended for best performance. Calls to model.zero_grad() or optimizer.zero_grad() will not affect .grad layouts.

In fact, resetting all .grads to None before each accumulation phase, e.g.:

for iterations...
    ...
    for param in model.parameters():
        param.grad = None
    loss.backward()

such that they’re recreated according to 1 or 2 every time, is a valid alternative to model.zero_grad() or optimizer.zero_grad() that may improve performance for some networks.

Manual gradient layouts

If you need manual control over .grad’s strides, assign param.grad = a zeroed tensor with desired strides before the first backward(), and never reset it to None. 3 guarantees your layout is preserved as long as create_graph=False. 4 indicates your layout is likely preserved even if create_graph=True.

In-place operations on Tensors

Supporting in-place operations in autograd is a hard matter, and we discourage their use in most cases. Autograd’s aggressive buffer freeing and reuse makes it very efficient and there are very few occasions when in-place operations actually lower memory usage by any significant amount. Unless you’re operating under heavy memory pressure, you might never need to use them.

In-place correctness checks

All Tensor s keep track of in-place operations applied to them, and if the implementation detects that a tensor was saved for backward in one of the functions, but it was modified in-place afterwards, an error will be raised once backward pass is started. This ensures that if you’re using in-place functions and not seeing any errors, you can be sure that the computed gradients are correct.

Variable (deprecated)

Warning

The Variable API has been deprecated: Variables are no longer necessary to use autograd with tensors. Autograd automatically supports Tensors with requires_grad set to True. Below please find a quick guide on what has changed:

Variable(tensor) and Variable(tensor, requires_grad) still work as expected, but they return Tensors instead of Variables.
var.data is the same thing as tensor.data.
Methods such as var.backward(), var.detach(), var.register_hook() now work on tensors with the same method names.

In addition, one can now create tensors with requires_grad=True using factory methods such as torch.randn(), torch.zeros(), torch.ones(), and others like the following:

autograd_tensor = torch.randn((2, 3, 4), requires_grad=True)

Tensor autograd functions

`torch.Tensor.grad`	This attribute is `None` by default and becomes a Tensor the first time a call to `backward()` computes gradients for `self`.
`torch.Tensor.requires_grad`	Is `True` if gradients need to be computed for this Tensor, `False` otherwise.
`torch.Tensor.is_leaf`	All Tensors that have `requires_grad` which is `False` will be leaf Tensors by convention.
`torch.Tensor.backward`([gradient, …])	Computes the gradient of current tensor w.r.t.
`torch.Tensor.detach`	Returns a new Tensor, detached from the current graph.
`torch.Tensor.detach_`	Detaches the Tensor from the graph that created it, making it a leaf.
`torch.Tensor.register_hook`(hook)	Registers a backward hook.
`torch.Tensor.retain_grad`()	Enables .grad attribute for non-leaf Tensors.

Function

class torch.autograd.Function(*args, **kwargs)[source]

Records operation history and defines formulas for differentiating ops.

See the Note on extending the autograd engine for more details on how to use this class: https://pytorch.org/docs/stable/notes/extending.html#extending-torch-autograd

Every operation performed on Tensor s creates a new function object, that performs the computation, and records that it happened. The history is retained in the form of a DAG of functions, with edges denoting data dependencies (input <- output). Then, when backward is called, the graph is processed in the topological ordering, by calling backward() methods of each Function object, and passing returned gradients on to next Function s.

Normally, the only way users interact with functions is by creating subclasses and defining new operations. This is a recommended way of extending torch.autograd.

Examples:

>>> class Exp(Function):
>>>
>>>     @staticmethod
>>>     def forward(ctx, i):
>>>         result = i.exp()
>>>         ctx.save_for_backward(result)
>>>         return result
>>>
>>>     @staticmethod
>>>     def backward(ctx, grad_output):
>>>         result, = ctx.saved_tensors
>>>         return grad_output * result
>>>
>>> #Use it by calling the apply method:
>>> output = Exp.apply(input)

`Function.backward`	Defines a formula for differentiating the operation.
`Function.forward`	Performs the operation.

Context method mixins

When creating a new Function, the following methods are available to ctx.

`function._ContextMethodMixin.mark_dirty`	Marks given tensors as modified in an in-place operation.
`function._ContextMethodMixin.mark_non_differentiable`	Marks outputs as non-differentiable.
`function._ContextMethodMixin.save_for_backward`	Saves given tensors for a future call to `backward()`.
`function._ContextMethodMixin.set_materialize_grads`	Sets whether to materialize output grad tensors.

Numerical gradient checking

gradcheck

Check gradients computed via small finite differences against analytical gradients w.r.t.

gradgradcheck

Check gradients of gradients computed via small finite differences against analytical gradients w.r.t.

Profiler

Autograd includes a profiler that lets you inspect the cost of different operators inside your model - both on the CPU and GPU. There are two modes implemented at the moment - CPU-only using profile. and nvprof based (registers both CPU and GPU activity) using emit_nvtx.

class torch.autograd.profiler.profile(enabled=True, *, use_cuda=False, record_shapes=False, with_flops=False, profile_memory=False, with_stack=False, use_kineto=False, use_cpu=True)[source]

Context manager that manages autograd profiler state and holds a summary of results. Under the hood it just records events of functions being executed in C++ and exposes those events to Python. You can wrap any code into it and it will only report runtime of PyTorch functions. Note: profiler is thread local and is automatically propagated into the async tasks

Parameters

enabled (bool, optional) – Setting this to False makes this context manager a no-op.
use_cuda (bool, optional) – Enables timing of CUDA events as well using the cudaEvent API. Adds approximately 4us of overhead to each tensor operation.
record_shapes (bool, optional) – If shapes recording is set, information about input dimensions will be collected. This allows one to see which dimensions have been used under the hood and further group by them using prof.key_averages(group_by_input_shape=True). Please note that shape recording might skew your profiling data. It is recommended to use separate runs with and without shape recording to validate the timing. Most likely the skew will be negligible for bottom most events (in a case of nested function calls). But for higher level functions the total self cpu time might be artificially increased because of the shape collection.
with_flops (bool, optional) – If with_flops is set, the profiler will estimate the FLOPS (floating pointer operations per second) value using the operator’s input shape and total time. This allows one to estimate the hardware performance. Currently, this option only works for the matrix multiplication and 2D convolution operators.
profile_memory (bool, optional) – track tensor memory allocation/deallocation.
with_stack (bool, optional) – record source information (file and line number) for the ops.
use_kineto (bool, optional) – experimental, enable profiling with Kineto profiler.
use_cpu (bool, optional) – profile CPU events; setting to False requires use_kineto=True and can be used to lower the overhead for GPU-only profiling.

Example

>>> x = torch.randn((1, 1), requires_grad=True)
>>> with torch.autograd.profiler.profile() as prof:
>>>     for _ in range(100):  # any normal python code, really!
>>>         y = x ** 2
>>          y.backward()
>>> # NOTE: some columns were removed for brevity
>>> print(prof.key_averages().table(sort_by="self_cpu_time_total"))
-----------------------------------  ---------------  ---------------  ---------------
Name                                 Self CPU total   CPU time avg     Number of Calls
-----------------------------------  ---------------  ---------------  ---------------
mul                                  32.048ms         32.048ms         200
pow                                  27.041ms         27.041ms         200
PowBackward0                         9.727ms          55.483ms         100
torch::autograd::AccumulateGrad      9.148ms          9.148ms          100
torch::autograd::GraphRoot           691.816us        691.816us        100
-----------------------------------  ---------------  ---------------  ---------------

`profiler.profile.export_chrome_trace`	Exports an EventList as a Chrome tracing tools file.
`profiler.profile.key_averages`	Averages all function events over their keys.
`profiler.profile.self_cpu_time_total`	Returns total time spent on CPU obtained as a sum of all self times across all the events.
`profiler.profile.total_average`	Averages all events.

class torch.autograd.profiler.emit_nvtx(enabled=True, record_shapes=False)[source]

Context manager that makes every autograd operation emit an NVTX range.

It is useful when running the program under nvprof:

nvprof --profile-from-start off -o trace_name.prof -- <regular command here>

Unfortunately, there’s no way to force nvprof to flush the data it collected to disk, so for CUDA profiling one has to use this context manager to annotate nvprof traces and wait for the process to exit before inspecting them. Then, either NVIDIA Visual Profiler (nvvp) can be used to visualize the timeline, or torch.autograd.profiler.load_nvprof() can load the results for inspection e.g. in Python REPL.

Parameters

enabled (bool, optional, default=True) – Setting enabled=False makes this context manager a no-op. Default: True.
record_shapes (bool, optional, default=False) – If record_shapes=True, the nvtx range wrapping each autograd op will append information about the sizes of Tensor arguments received by that op, in the following format: [[arg0.size(0), arg0.size(1), ...], [arg1.size(0), arg1.size(1), ...], ...] Non-tensor arguments will be represented by []. Arguments will be listed in the order they are received by the backend op. Please note that this order may not match the order in which those arguments were passed on the Python side. Also note that shape recording may increase the overhead of nvtx range creation.

Example

>>> with torch.cuda.profiler.profile():
...     model(x) # Warmup CUDA memory allocator and profiler
...     with torch.autograd.profiler.emit_nvtx():
...         model(x)

Forward-backward correlation

When viewing a profile created using emit_nvtx in the Nvidia Visual Profiler, correlating each backward-pass op with the corresponding forward-pass op can be difficult. To ease this task, emit_nvtx appends sequence number information to the ranges it generates.

During the forward pass, each function range is decorated with seq=<N>. seq is a running counter, incremented each time a new backward Function object is created and stashed for backward. Thus, the seq=<N> annotation associated with each forward function range tells you that if a backward Function object is created by this forward function, the backward object will receive sequence number N. During the backward pass, the top-level range wrapping each C++ backward Function’s apply() call is decorated with stashed seq=<M>. M is the sequence number that the backward object was created with. By comparing stashed seq numbers in backward with seq numbers in forward, you can track down which forward op created each backward Function.

Any functions executed during the backward pass are also decorated with seq=<N>. During default backward (with create_graph=False) this information is irrelevant, and in fact, N may simply be 0 for all such functions. Only the top-level ranges associated with backward Function objects’ apply() methods are useful, as a way to correlate these Function objects with the earlier forward pass.

Double-backward

If, on the other hand, a backward pass with create_graph=True is underway (in other words, if you are setting up for a double-backward), each function’s execution during backward is given a nonzero, useful seq=<N>. Those functions may themselves create Function objects to be executed later during double-backward, just as the original functions in the forward pass did. The relationship between backward and double-backward is conceptually the same as the relationship between forward and backward: The functions still emit current-sequence-number-tagged ranges, the Function objects they create still stash those sequence numbers, and during the eventual double-backward, the Function objects’ apply() ranges are still tagged with stashed seq numbers, which can be compared to seq numbers from the backward pass.

profiler.load_nvprof

Opens an nvprof trace file and parses autograd annotations.

Anomaly detection

class torch.autograd.detect_anomaly[source]

Context-manager that enable anomaly detection for the autograd engine.

This does two things:

Running the forward pass with detection enabled will allow the backward pass to print the traceback of the forward operation that created the failing backward function.
Any backward computation that generate “nan” value will raise an error.

Warning

This mode should be enabled only for debugging as the different tests will slow down your program execution.

Example

>>> import torch
>>> from torch import autograd
>>> class MyFunc(autograd.Function):
...     @staticmethod
...     def forward(ctx, inp):
...         return inp.clone()
...     @staticmethod
...     def backward(ctx, gO):
...         # Error during the backward pass
...         raise RuntimeError("Some error in backward")
...         return gO.clone()
>>> def run_fn(a):
...     out = MyFunc.apply(a)
...     return out.sum()
>>> inp = torch.rand(10, 10, requires_grad=True)
>>> out = run_fn(inp)
>>> out.backward()
    Traceback (most recent call last):
      File "<stdin>", line 1, in <module>
      File "/your/pytorch/install/torch/_tensor.py", line 93, in backward
        torch.autograd.backward(self, gradient, retain_graph, create_graph)
      File "/your/pytorch/install/torch/autograd/__init__.py", line 90, in backward
        allow_unreachable=True)  # allow_unreachable flag
      File "/your/pytorch/install/torch/autograd/function.py", line 76, in apply
        return self._forward_cls.backward(self, *args)
      File "<stdin>", line 8, in backward
    RuntimeError: Some error in backward
>>> with autograd.detect_anomaly():
...     inp = torch.rand(10, 10, requires_grad=True)
...     out = run_fn(inp)
...     out.backward()
    Traceback of forward call that caused the error:
      File "tmp.py", line 53, in <module>
        out = run_fn(inp)
      File "tmp.py", line 44, in run_fn
        out = MyFunc.apply(a)
    Traceback (most recent call last):
      File "<stdin>", line 4, in <module>
      File "/your/pytorch/install/torch/_tensor.py", line 93, in backward
        torch.autograd.backward(self, gradient, retain_graph, create_graph)
      File "/your/pytorch/install/torch/autograd/__init__.py", line 90, in backward
        allow_unreachable=True)  # allow_unreachable flag
      File "/your/pytorch/install/torch/autograd/function.py", line 76, in apply
        return self._forward_cls.backward(self, *args)
      File "<stdin>", line 8, in backward
    RuntimeError: Some error in backward

class torch.autograd.set_detect_anomaly(mode)[source]

Context-manager that sets the anomaly detection for the autograd engine on or off.

set_detect_anomaly will enable or disable the autograd anomaly detection based on its argument mode. It can be used as a context-manager or as a function.

See detect_anomaly above for details of the anomaly detection behaviour.

Parameters: mode (bool) – Flag whether to enable anomaly detection (True), or disable (False).

Automatic differentiation package - torch.autograd

Functional higher level API

Locally disabling gradient computation

Default gradient layouts

Manual gradient layouts

In-place operations on Tensors

In-place correctness checks

Variable (deprecated)

Tensor autograd functions

Function

Context method mixins

Numerical gradient checking

Profiler

Anomaly detection

Docs

Tutorials

Resources