.. role:: hidden :class: hidden-section Automatic Mixed Precision package - torch.cuda.amp ================================================== .. automodule:: torch.cuda.amp .. currentmodule:: torch.cuda.amp ``torch.cuda.amp`` provides convenience methods for running networks with mixed precision, where some operations use the ``torch.float32`` (``float``) datatype and other operations use ``torch.float16`` (``half``). Some operations, like linear layers and convolutions, are much faster in ``float16``. Other operations, like reductions, often require the dynamic range of ``float32``. Networks running in mixed precision try to match each operation to its appropriate datatype. .. warning:: :class:`torch.cuda.amp.GradScaler` is not a complete implementation of automatic mixed precision. :class:`GradScaler` is only useful if you manually run regions of your model in ``float16``. If you aren't sure how to choose op precision manually, the master branch and nightly pip/conda builds include a context manager that chooses op precision automatically wherever it's enabled. See the `master documentation <https://pytorch.org/docs/master/amp.html>`_ for details. .. contents:: :local: .. _gradient-scaling: Gradient Scaling ^^^^^^^^^^^^^^^^ When training a network with mixed precision, if the forward pass for a particular op has ``torch.float16`` inputs, the backward pass for that op will produce ``torch.float16`` gradients. Gradient values with small magnitudes may not be representable in ``torch.float16``. These values will flush to zero ("underflow"), so the update for the corresponding parameters will be lost. To prevent underflow, "gradient scaling" multiplies the network's loss(es) by a scale factor and invokes a backward pass on the scaled loss(es). Gradients flowing backward through the network are then scaled by the same factor. In other words, gradient values have a larger magnitude, so they don't flush to zero. The parameters' gradients (``.grad`` attributes) should be unscaled before the optimizer uses them to update the parameters, so the scale factor does not interfere with the learning rate. .. autoclass:: GradScaler :members: