Quantization API Reference¶
torch.quantization¶
This module contains Eager mode quantization APIs.
Top level APIs¶
Quantize the input float model with post training static quantization. 

Converts a float model to dynamic (i.e. 

Do quantization aware training and output a quantized model 

Prepares a copy of the model for quantization calibration or quantizationaware training. 

Prepares a copy of the model for quantization calibration or quantizationaware training and converts it to quantized version. 

Converts submodules in input module to a different module according to mapping by calling from_float method on the target module class. 
Preparing model for quantization¶
Fuses a list of modules into a single module 

Quantize stub module, before calibration, this is same as an observer, it will be swapped as nnq.Quantize in convert. 

Dequantize stub module, before calibration, this is same as identity, this will be swapped as nnq.DeQuantize in convert. 

A wrapper class that wraps the input module, adds QuantStub and DeQuantStub and surround the call to module with call to quant and dequant modules. 

Wrap the leaf child module in QuantWrapper if it has a valid qconfig Note that this function will modify the children of module inplace and it can return a new module which wraps the input module as well. 
Utility functions¶
Add observer for the leaf child of the module. 

Swaps the module if it has a quantized counterpart and it has an observer attached. 

Propagate qconfig through the module hierarchy and assign qconfig attribute on each leaf module 

Default evaluation function takes a torch.utils.data.Dataset or a list of input Tensors and run the model on the dataset 

Traverse the modules and save all observers into dict. 
torch.quantization.quantize_fx¶
This module contains FX graph mode quantization APIs (prototype).
Prepare a model for post training static quantization 

Prepare a model for quantization aware training 

Convert a calibrated or trained model to a quantized model 

Fuse modules like conv+bn, conv+bn+relu etc, model must be in eval mode. 
torch.quantization.observer¶
This module contains observers which are used to collect statistics about the values observed during calibration (PTQ) or training (QAT).
Base observer Module. 

Observer module for computing the quantization parameters based on the running min and max values. 

Observer module for computing the quantization parameters based on the moving average of the min and max values. 

Observer module for computing the quantization parameters based on the running per channel min and max values. 

Observer module for computing the quantization parameters based on the running per channel min and max values. 

The module records the running histogram of tensor values along with min/max values. 

Observer that doesn’t do anything and just passes its configuration to the quantized module’s 

The module is mainly for debug and records the tensor values during runtime. 

Observer that doesn’t do anything and just passes its configuration to the quantized module’s 

Returns the state dict corresponding to the observer stats. 

Given input model and a state_dict containing model observer stats, load the stats back into the model. 

Default placeholder observer, usually used for quantization to torch.float16. 

Default debugonly observer. 

torch.quantization.fake_quantize¶
This module implements modules which are used to perform fake quantization during QAT.
Base fake quantize module Any fake quantize implementation should derive from this class. 

Simulate the quantize and dequantize operations in training time. The output of this module is given by::. 

Simulate quantize and dequantize with fixed quantization parameters in training time. 

Fused module that is used to observe the input tensor (compute min/max), compute scale/zero_point and fake_quantize the tensor. 

Disable fake quantization for this module, if applicable. Example usage::. 

Enable fake quantization for this module, if applicable. Example usage::. 

Disable observation for this module, if applicable. Example usage::. 

Enable observation for this module, if applicable. Example usage::. 
torch.quantization.qconfig¶
This module defines QConfig objects which are used to configure quantization settings for individual ops.
Describes how to quantize a layer or a part of the network by providing settings (observer classes) for activations and weights respectively. 

Describes how to quantize a layer or a part of the network by providing settings (observer classes) for activations and weights respectively. 

Describes how to quantize a layer or a part of the network by providing settings (observer classes) for activations and weights respectively. 

Describes how to quantize a layer or a part of the network by providing settings (observer classes) for activations and weights respectively. 

Describes how to quantize a layer or a part of the network by providing settings (observer classes) for activations and weights respectively. 

Describes how to quantize a layer or a part of the network by providing settings (observer classes) for activations and weights respectively. 

Describes how to quantize a layer or a part of the network by providing settings (observer classes) for activations and weights respectively. 

Describes how to quantize a layer or a part of the network by providing settings (observer classes) for activations and weights respectively. 

Describes how to quantize a layer or a part of the network by providing settings (observer classes) for activations and weights respectively. 

Describes how to quantize a layer or a part of the network by providing settings (observer classes) for activations and weights respectively. 

Describes how to quantize a layer or a part of the network by providing settings (observer classes) for activations and weights respectively. 

Describes how to quantize a layer or a part of the network by providing settings (observer classes) for activations and weights respectively. 

Describes how to quantize a layer or a part of the network by providing settings (observer classes) for activations and weights respectively. 
torch.nn.intrinsic¶
This module implements the combined (fused) modules conv + relu which can then be quantized.
This is a sequential container which calls the Conv1d and ReLU modules. 

This is a sequential container which calls the Conv2d and ReLU modules. 

This is a sequential container which calls the Conv3d and ReLU modules. 

This is a sequential container which calls the Linear and ReLU modules. 

This is a sequential container which calls the Conv 1d and Batch Norm 1d modules. 

This is a sequential container which calls the Conv 2d and Batch Norm 2d modules. 

This is a sequential container which calls the Conv 3d and Batch Norm 3d modules. 

This is a sequential container which calls the Conv 1d, Batch Norm 1d, and ReLU modules. 

This is a sequential container which calls the Conv 2d, Batch Norm 2d, and ReLU modules. 

This is a sequential container which calls the Conv 3d, Batch Norm 3d, and ReLU modules. 

This is a sequential container which calls the BatchNorm 2d and ReLU modules. 

This is a sequential container which calls the BatchNorm 3d and ReLU modules. 
torch.nn.intrinsic.qat¶
This module implements the versions of those fused operations needed for quantization aware training.
A LinearReLU module fused from Linear and ReLU modules, attached with FakeQuantize modules for weight, used in quantization aware training. 

A ConvBn1d module is a module fused from Conv1d and BatchNorm1d, attached with FakeQuantize modules for weight, used in quantization aware training. 

A ConvBnReLU1d module is a module fused from Conv1d, BatchNorm1d and ReLU, attached with FakeQuantize modules for weight, used in quantization aware training. 

A ConvBn2d module is a module fused from Conv2d and BatchNorm2d, attached with FakeQuantize modules for weight, used in quantization aware training. 

A ConvBnReLU2d module is a module fused from Conv2d, BatchNorm2d and ReLU, attached with FakeQuantize modules for weight, used in quantization aware training. 

A ConvReLU2d module is a fused module of Conv2d and ReLU, attached with FakeQuantize modules for weight for quantization aware training. 

A ConvBn3d module is a module fused from Conv3d and BatchNorm3d, attached with FakeQuantize modules for weight, used in quantization aware training. 

A ConvBnReLU3d module is a module fused from Conv3d, BatchNorm3d and ReLU, attached with FakeQuantize modules for weight, used in quantization aware training. 

A ConvReLU3d module is a fused module of Conv3d and ReLU, attached with FakeQuantize modules for weight for quantization aware training. 

torch.nn.intrinsic.quantized¶
This module implements the quantized implementations of fused operations like conv + relu. No BatchNorm variants as it’s usually folded into convolution for inference.
A BNReLU2d module is a fused module of BatchNorm2d and ReLU 

A BNReLU3d module is a fused module of BatchNorm3d and ReLU 

A ConvReLU1d module is a fused module of Conv1d and ReLU 

A ConvReLU2d module is a fused module of Conv2d and ReLU 

A ConvReLU3d module is a fused module of Conv3d and ReLU 

A LinearReLU module fused from Linear and ReLU modules 
torch.nn.intrinsic.quantized.dynamic¶
This module implements the quantized dynamic implementations of fused operations like linear + relu.
A LinearReLU module fused from Linear and ReLU modules that can be used for dynamic quantization. 
torch.nn.qat¶
This module implements versions of the key nn modules Conv2d() and Linear() which run in FP32 but with rounding applied to simulate the effect of INT8 quantization.
A Conv2d module attached with FakeQuantize modules for weight, used for quantization aware training. 

A Conv3d module attached with FakeQuantize modules for weight, used for quantization aware training. 

A linear module attached with FakeQuantize modules for weight, used for quantization aware training. 
torch.nn.qat.dynamic¶
This module implements versions of the key nn modules such as Linear() which run in FP32 but with rounding applied to simulate the effect of INT8 quantization and will be dynamically quantized during inference.
A linear module attached with FakeQuantize modules for weight, used for dynamic quantization aware training. 
torch.nn.quantized¶
This module implements the quantized versions of the nn layers such as ~`torch.nn.Conv2d` and torch.nn.ReLU.
Applies the elementwise function: 

This is the quantized version of 

This is the quantized equivalent of 

This is the quantized equivalent of 

This is the quantized equivalent of 

This is the quantized version of 

This is the quantized version of 

Applies a 1D convolution over a quantized input signal composed of several quantized input planes. 

Applies a 2D convolution over a quantized input signal composed of several quantized input planes. 

Applies a 3D convolution over a quantized input signal composed of several quantized input planes. 

Applies a 1D transposed convolution operator over an input image composed of several input planes. 

Applies a 2D transposed convolution operator over an input image composed of several input planes. 

Applies a 3D transposed convolution operator over an input image composed of several input planes. 

A quantized Embedding module with quantized packed weights as inputs. 

A quantized EmbeddingBag module with quantized packed weights as inputs. 

State collector class for float operations. 

module to replace FloatFunctional module before FX graph mode quantization, since activation_post_process will be inserted in top level module directly 

Wrapper class for quantized operations. 

A quantized linear module with quantized tensor as inputs and outputs. 

This is the quantized version of 

This is the quantized version of 

This is the quantized version of 

This is the quantized version of 

This is the quantized version of 
torch.nn.quantized.functional¶
This module implements the quantized versions of the functional layers such as
~`torch.nn.functional.conv2d` and torch.nn.functional.relu. Note:
relu()
supports quantized inputs.
Applies 2D averagepooling operation in $kH \times kW$ regions by step size $sH \times sW$ steps. 

Applies 3D averagepooling operation in $kD \ times kH \times kW$ regions by step size $sD \times sH \times sW$ steps. 

Applies a 2D adaptive average pooling over a quantized input signal composed of several quantized input planes. 

Applies a 3D adaptive average pooling over a quantized input signal composed of several quantized input planes. 

Applies a 1D convolution over a quantized 1D input composed of several input planes. 

Applies a 2D convolution over a quantized 2D input composed of several input planes. 

Applies a 3D convolution over a quantized 3D input composed of several input planes. 

Down/up samples the input to either the given 

Applies a linear transformation to the incoming quantized data: $y = xA^T + b$. 

Applies a 1D max pooling over a quantized input signal composed of several quantized input planes. 

Applies a 2D max pooling over a quantized input signal composed of several quantized input planes. 

Applies the quantized CELU function elementwise. 

Applies elementwise, $\text{LeakyReLU}(x) = \max(0, x) + \text{negative\_slope} * \min(0, x)$ 

This is the quantized version of 

This is the quantized version of 

Applies the quantized version of the threshold function elementwise: 

This is the quantized version of 

This is the quantized version of 

float(input, min_, max_) > Tensor 

Upsamples the input to either the given 

Upsamples the input, using bilinear upsampling. 

Upsamples the input, using nearest neighbours’ pixel values. 
torch.nn.quantized.dynamic¶
Dynamically quantized Linear
, LSTM
,
LSTMCell
, GRUCell
, and
RNNCell
.
A dynamic quantized linear module with floating point tensor as inputs and outputs. 

A dynamic quantized LSTM module with floating point tensor as inputs and outputs. 

Applies a multilayer gated recurrent unit (GRU) RNN to an input sequence. 

An Elman RNN cell with tanh or ReLU nonlinearity. 

A long shortterm memory (LSTM) cell. 

A gated recurrent unit (GRU) cell 
Quantized dtypes and quantization schemes¶
Note that operator implementations currently only support per channel quantization for weights of the conv and linear operators. Furthermore, the input data is mapped linearly to the the quantized data and vice versa as follows:
$\begin{aligned} \text{Quantization:}&\\ &Q_\text{out} = \text{clamp}(x_\text{input}/s+z, Q_\text{min}, Q_\text{max})\\ \text{Dequantization:}&\\ &x_\text{out} = (Q_\text{input}z)*s \end{aligned}$
where $\text{clamp}(.)$ is the same as clamp()
while the
scale $s$ and zero point $z$ are then computed
as decribed in MinMaxObserver
, specifically:
$\begin{aligned} \text{if Symmetric:}&\\ &s = 2 \max(x_\text{min}, x_\text{max}) / \left( Q_\text{max}  Q_\text{min} \right) \\ &z = \begin{cases} 0 & \text{if dtype is qint8} \\ 128 & \text{otherwise} \end{cases}\\ \text{Otherwise:}&\\ &s = \left( x_\text{max}  x_\text{min} \right ) / \left( Q_\text{max}  Q_\text{min} \right ) \\ &z = Q_\text{min}  \text{round}(x_\text{min} / s) \end{aligned}$
where $[x_\text{min}, x_\text{max}]$ denotes the range of the input data while $Q_\text{min}$ and $Q_\text{max}$ are respectively the minimum and maximum values of the quantized dtype.
Note that the choice of $s$ and $z$ implies that zero is represented with no quantization error whenever zero is within the range of the input data or symmetric quantization is being used.
Additional data types and quantization schemes can be implemented through the custom operator mechanism.
torch.qscheme
— Type to describe the quantization scheme of a tensor. Supported types:torch.per_tensor_affine
— per tensor, asymmetrictorch.per_channel_affine
— per channel, asymmetrictorch.per_tensor_symmetric
— per tensor, symmetrictorch.per_channel_symmetric
— per channel, symmetric
torch.dtype
— Type to describe the data. Supported types:torch.quint8
— 8bit unsigned integertorch.qint8
— 8bit signed integertorch.qint32
— 32bit signed integer