Quantization¶
Introduction to Quantization¶
Quantization refers to techniques for performing computations and storing tensors at lower bitwidths than floating point precision. A quantized model executes some or all of the operations on tensors with integers rather than floating point values. This allows for a more compact model representation and the use of high performance vectorized operations on many hardware platforms. PyTorch supports INT8 quantization compared to typical FP32 models allowing for a 4x reduction in the model size and a 4x reduction in memory bandwidth requirements. Hardware support for INT8 computations is typically 2 to 4 times faster compared to FP32 compute. Quantization is primarily a technique to speed up inference and only the forward pass is supported for quantized operators.
PyTorch supports multiple approaches to quantizing a deep learning model. In most cases the model is trained in FP32 and then the model is converted to INT8. In addition, PyTorch also supports quantization aware training, which models quantization errors in both the forward and backward passes using fakequantization modules. Note that the entire computation is carried out in floating point. At the end of quantization aware training, PyTorch provides conversion functions to convert the trained model into lower precision.
At lower level, PyTorch provides a way to represent quantized tensors and perform operations with them. They can be used to directly construct models that perform all or part of the computation in lower precision. Higherlevel APIs are provided that incorporate typical workflows of converting FP32 model to lower precision with minimal accuracy loss.
Today, PyTorch supports the following backends for running quantized operators efficiently:
x86 CPUs with AVX2 support or higher (without AVX2 some operations have inefficient implementations)
ARM CPUs (typically found in mobile/embedded devices)
The corresponding implementation is chosen automatically based on the PyTorch build mode.
Note
PyTorch 1.3 doesn’t provide quantized operator implementations on CUDA yet  this is direction of future work. Move the model to CPU in order to test the quantized functionality.
Quantizationaware training (through FakeQuantize
) supports both CPU and CUDA.
Quantized Tensors¶
PyTorch supports both per tensor and per channel asymmetric linear quantization. Per tensor means that all the values within the tensor are scaled the same way. Per channel means that for each dimension, typically the channel dimension of a tensor, the values in the tensor are scaled and offset by a different value (effectively the scale and offset become vectors). This allows for lesser error in converting tensors to quantized values.
The mapping is performed by converting the floating point tensors using
Note that, we ensure that zero in floating point is represented with no error after quantization, thereby ensuring that operations like padding do not cause additional quantization error.
In order to do quantization in PyTorch, we need to be able to represent quantized data in Tensors. A Quantized Tensor allows for storing quantized data (represented as int8/uint8/int32) along with quantization parameters like scale and zero_point. Quantized Tensors allow for many useful operations making quantized arithmetic easy, in addition to allowing for serialization of data in a quantized format.
Operation coverage¶
Quantized Tensors support a limited subset of data manipulation methods of the regular fullprecision tensor. (see list below)
For NN operators included in PyTorch, we restrict support to:
8 bit weights (data_type = qint8)
8 bit activations (data_type = quint8)
Note that operator implementations currently only support per channel quantization for weights of the conv and linear operators. Furthermore the minimum and the maximum of the input data is mapped linearly to the minimum and the maximum of the quantized data type such that zero is represented with no quantization error.
Additional data types and quantization schemes can be implemented through the custom operator mechanism.
Many operations for quantized tensors are available under the same API as full
float version in torch
or torch.nn
. Quantized version of NN modules that
perform requantization are available in torch.nn.quantized
. Those
operations explicitly take output quantization parameters (scale and zero_point) in
the operation signature.
In addition, we also support fused versions corresponding to common fusion patterns that impact quantization at: torch.nn.intrinsic.quantized.
For quantization aware training, we support modules prepared for quantization aware training at torch.nn.qat and torch.nn.intrinsic.qat
Current quantized operation list is sufficient to cover typical CNN and RNN models:
Quantized torch.Tensor
operations¶
Operations that are available from the torch
namespace or as methods on Tensor for quantized tensors:
quantize_per_tensor()
 Convert float tensor to quantized tensor with pertensor scale and zero pointquantize_per_channel()
 Convert float tensor to quantized tensor with perchannel scale and zero pointViewbased operations like
view()
,as_strided()
,expand()
,flatten()
,slice()
, pythonstyle indexing, etc  work as on regular tensor (if quantization is not perchannel)copy_()
— Copies src to self inplaceclone()
— Returns a deep copy of the passedin tensordequantize()
— Convert quantized tensor to float tensorequal()
— Compares two tensors, returns true if quantization parameters and all integer elements are the sameint_repr()
— Prints the underlying integer representation of the quantized tensormax()
— Returns the maximum value of the tensor (reduction only)mean()
— Mean function. Supported variants: reduction, dim, outmin()
— Returns the minimum value of the tensor (reduction only)q_scale()
— Returns the scale of the pertensor quantized tensorq_zero_point()
— Returns the zero_point of the pertensor quantized zero pointq_per_channel_scales()
— Returns the scales of the perchannel quantized tensorq_per_channel_zero_points()
— Returns the zero points of the perchannel quantized tensorq_per_channel_axis()
— Returns the channel axis of the perchannel quantized tensorrelu()
— Rectified linear unit (copy)relu_()
— Rectified linear unit (inplace)resize_()
— Inplace resizesort()
— Sorts the tensortopk()
— Returns k largest values of a tensor
torch.nn.intrinsic
¶
Fused modules are provided for common patterns in CNNs. Combining several operations together (like convolution and relu) allows for better quantization accuracy
torch.nn.intrinsic
— float versions of the modules, can be swapped with quantized version 1 to 1ConvBn2d
— Conv2d + BatchNormConvBnReLU2d
— Conv2d + BatchNorm + ReLUConvReLU2d
— Conv2d + ReluLinearReLU
— Linear + ReLU
torch.nn.intrinsic.qat
— versions of layers for quantizationaware trainingConvBn2d
— Conv2d + BatchNormConvBnReLU2d
— Conv2d + BatchNorm + ReLUConvReLU2d
— Conv2d + ReLULinearReLU
— Linear + ReLU
torch.nn.intrinsic.quantized
— quantized version of fused layers for inference (no BatchNorm variants as it’s usually folded into convolution for inference)LinearReLU
— Linear + ReLUConvReLU2d
— 2D Convolution + ReLU
torch.nn.qat
¶
Layers for the quantizationaware training
torch.quantization
¶
 Functions for quantization
add_observer_()
— Adds observer for the leaf modules (if quantization configuration is provided)add_quant_dequant()
— Wraps the leaf child module usingQuantWrapper
convert()
— Converts float module with observers into its quantized counterpart. Must have quantization configurationget_observer_dict()
— Traverses the module children and collects all observers into adict
prepare()
— Prepares a copy of a model for quantizationprepare_qat()
— Prepares a copy of a model for quantization aware trainingpropagate_qconfig_()
— Propagates quantization configurations through the module hierarchy and assign them to each leaf modulequantize()
— Converts a float module to quantized versionquantize_dynamic()
— Converts a float module to dynamically quantized versionquantize_qat()
— Converts a float module to quantized version used in quantization aware trainingswap_module()
— Swaps the module with its quantized counterpart (if quantizable and if it has an observer)
default_eval_fn()
— Default evaluation function used by thetorch.quantization.quantize()
FakeQuantize
— Module for simulating the quantization/dequantization at training time Default Observers. The rest of observers are available from
torch.quantization.observer
default_observer
— Same asMinMaxObserver.with_args(reduce_range=True)
default_weight_observer
— Same asMinMaxObserver.with_args(dtype=torch.qint8, qscheme=torch.per_tensor_symmetric)
Observer
— Abstract base class for observers
 Default Observers. The rest of observers are available from
 Quantization configurations
QConfig
— Quantization configuration classdefault_qconfig
— Same asQConfig(activation=default_observer, weight=default_weight_observer)
(SeeQConfig
)default_qat_qconfig
— Same asQConfig(activation=default_fake_quant, weight=default_weight_fake_quant)
(SeeQConfig
)default_dynamic_qconfig
— Same asQConfigDynamic(weight=default_weight_observer)
(SeeQConfigDynamic
)float16_dynamic_qconfig
— Same asQConfigDynamic(weight=NoopObserver.with_args(dtype=torch.float16))
(SeeQConfigDynamic
)
 Stubs
DeQuantStub
 placeholder module for dequantize() operation in floatvalued modelsQuantStub
 placeholder module for quantize() operation in floatvalued modelsQuantWrapper
— wraps the module to be quantized. Inserts theQuantStub
andDeQuantStub
Observers for computing the quantization parameters
MinMaxObserver
— Derives the quantization parameters from the running minimum and maximum of the observed tensor inputs (per tensor variant)MovingAverageObserver
— Derives the quantization parameters from the running averages of the minimums and maximums of the observed tensor inputs (per tensor variant)PerChannelMinMaxObserver
— Derives the quantization parameters from the running minimum and maximum of the observed tensor inputs (per channel variant)MovingAveragePerChannelMinMaxObserver
— Derives the quantization parameters from the running averages of the minimums and maximums of the observed tensor inputs (per channel variant)HistogramObserver
— Derives the quantization parameters by creating a histogram of running minimums and maximums. Observers that do not compute the quantization parameters:
RecordingObserver
— Records all incoming tensors. Used for debugging only.NoopObserver
— Passthrough observer. Used for situation when there are no quantization parameters (i.e. quantization tofloat16
)
torch.nn.quantized
¶
Quantized version of standard NN layers.
Quantize
— Quantization layer, used to automatically replaceQuantStub
DeQuantize
— Dequantization layer, used to replaceDeQuantStub
FloatFunctional
— Wrapper class to make stateless float operations stateful so that they can be replaced with quantized versionsQFunctional
— Wrapper class for quantized versions of stateless operations like`torch.add
Conv2d
— 2D convolutionLinear
— Linear (fullyconnected) layerMaxPool2d
— 2D max poolingReLU
— Rectified linear unitReLU6
— Rectified linear unit with cutoff at quantized representation of 6
torch.nn.quantized.dynamic
¶
Layers used in dynamically quantized models (i.e. quantized only on weights)
torch.nn.quantized.functional
¶
Functional versions of quantized NN layers (many of them accept explicit quantization output parameters)
adaptive_avg_pool2d()
— 2D adaptive average poolingavg_pool2d()
— 2D average poolingconv2d()
— 2D convolutioninterpolate()
— Down/up samplerlinear()
— Linear (fullyconnected) opmax_pool2d()
— 2D max poolingrelu()
— Rectified linear unitupsample()
— Upsampler. Will be deprecated in favor ofinterpolate()
upsample_bilinear()
— Bilenear upsampler. Will be deprecated in favor ofinterpolate()
upsample_nearest()
— Nearest neighbor upsampler. Will be deprecated in favor ofinterpolate()
Quantized dtypes and quantization schemes¶
torch.qscheme
— Type to describe the quantization scheme of a tensor. Supported types:torch.per_tensor_affine
— per tensor, asymmetrictorch.per_channel_affine
— per channel, asymmetrictorch.per_tensor_symmetric
— per tensor, symmetrictorch.per_channel_symmetric
— per tensor, symmetric
torch.dtype
— Type to describe the data. Supported types:torch.quint8
— 8bit unsigned integertorch.qint8
— 8bit signed integertorch.qint32
— 32bit signed integer
Quantization Workflows¶
PyTorch provides three approaches to quantize models.
Post Training Dynamic Quantization: This is the simplest to apply form of quantization where the weights are quantized ahead of time but the activations are dynamically quantized during inference. This is used for situations where the model execution time is dominated by loading weights from memory rather than computing the matrix multiplications. This is true for for LSTM and Transformer type models with small batch size. Applying dynamic quantization to a whole model can be done with a single call to
torch.quantization.quantize_dynamic()
. See the quantization tutorialsPost Training Static Quantization: This is the most commonly used form of quantization where the weights are quantized ahead of time and the scale factor and bias for the activation tensors is precomputed based on observing the behavior of the model during a calibration process. Post Training Quantization is typically when both memory bandwidth and compute savings are important with CNNs being a typical use case. The general process for doing post training quantization is:
Prepare the model: a. Specify where the activations are quantized and dequantized explicitly by adding QuantStub and DeQuantStub modules. b. Ensure that modules are not reused. c. Convert any operations that require requantization into modules
Fuse operations like conv + relu or conv+batchnorm + relu together to improve both model accuracy and performance.
Specify the configuration of the quantization methods ‘97 such as selecting symmetric or asymmetric quantization and MinMax or L2Norm calibration techniques.
Use the
torch.quantization.prepare()
to insert modules that will observe activation tensors during calibrationCalibrate the model by running inference against a calibration dataset
Finally, convert the model itself with the torch.quantization.convert() method. This does several things: it quantizes the weights, computes and stores the scale and bias value to be used each activation tensor, and replaces key operators quantized implementations.
See the quantization tutorials
Quantization Aware Training: In the rare cases where post training quantization does not provide adequate accuracy training can be done with simulated quantization using the
torch.quantization.FakeQuantize
. Computations will take place in FP32 but with values clamped and rounded to simulate the effects of INT8 quantization. The sequence of steps is very similar.Steps (1) and (2) are identical.
Specify the configuration of the fake quantization methods ‘97 such as selecting symmetric or asymmetric quantization and MinMax or Moving Average or L2Norm calibration techniques.
Use the
torch.quantization.prepare_qat()
to insert modules that will simulate quantization during training.Train or fine tune the model.
Identical to step (6) for post training quantization
See the quantization tutorials
While default implementations of observers to select the scale factor and bias based on observed tensor data are provided, developers can provide their own quantization functions. Quantization can be applied selectively to different parts of the model or configured differently for different parts of the model.
We also provide support for per channel quantization for conv2d() and linear()
Quantization workflows work by adding (e.g. adding observers as
.observer
submodule) or replacing (e.g. converting nn.Conv2d
to
nn.quantized.Conv2d
) submodules in the model’s module hierarchy. It
means that the model stays a regular nn.Module
based instance throughout the
process and thus can work with the rest of PyTorch APIs.
Model Preparation for Quantization¶
It is necessary to currently make some modifications to the model definition prior to quantization. This is because currently quantization works on a module by module basis. Specifically, for all quantization techniques, the user needs to:
Convert any operations that require output requantization (and thus have additional parameters) from functionals to module form.
Specify which parts of the model need to be quantized either by assigning
`.qconfig
attributes on submodules or by specifyingqconfig_dict
For static quantization techniques which quantize activations, the user needs to do the following in addition:
Specify where activations are quantized and dequantized. This is done using
QuantStub
andDeQuantStub
modules.Use
torch.nn.quantized.FloatFunctional
to wrap tensor operations that require special handling for quantization into modules. Examples are operations likeadd
andcat
which require special handling to determine output quantization parameters.Fuse modules: combine operations/modules into a single module to obtain higher accuracy and performance. This is done using the
torch.quantization.fuse_modules()
API, which takes in lists of modules to be fused. We currently support the following fusions: [Conv, Relu], [Conv, BatchNorm], [Conv, BatchNorm, Relu], [Linear, Relu]
torch.quantization¶
This module implements the functions you call
directly to convert your model from FP32 to quantized form. For
example the prepare()
is used in post training
quantization to prepares your model for the calibration step and
convert()
actually converts the weights to int8 and
replaces the operations with their quantized counterparts. There are
other helper functions for things like quantizing the input to your
model and performing critical fusions like conv+relu.
Toplevel quantization APIs¶

torch.quantization.
quantize
(model, run_fn, run_args, mapping=None, inplace=False)[source]¶ Converts a float model to quantized model.
First it will prepare the model for calibration or training, then it calls run_fn which will run the calibration step or training step, after that we will call convert which will convert the model to a quantized model.
 Parameters
model – input model
run_fn – a function for evaluating the prepared model, can be a function that simply runs the prepared model or a training loop
run_args – positional arguments for run_fn
inplace – carry out model transformations inplace, the original module is mutated
mapping – correspondence between original module types and quantized counterparts
 Returns
Quantized model.

torch.quantization.
quantize_dynamic
(model, qconfig_spec=None, dtype=torch.qint8, mapping=None, inplace=False)[source]¶ Converts a float model to dynamic (i.e. weightsonly) quantized model.
Replaces specified modules with dynamic weightonly quantized versions and output the quantized model.
For simplest usage provide dtype argument that can be float16 or qint8. Weightonly quantization by default is performed for layers with large weights size  i.e. Linear and RNN variants.
Fine grained control is possible with qconfig and mapping that act similarly to quantize(). If qconfig is provided, the dtype argument is ignored.
 Parameters
module – input model
qconfig_spec –
Either: * A dictionary that maps from name or type of submodule to quantization
configuration, qconfig applies to all submodules of a given module unless qconfig for the submodules are specified (when the submodule already has qconfig attribute). Entries in the dictionary need to be QConfigDynamic instances.
A set of types and/or submodule names to apply dynamic quantization to, in which case the dtype argument is used to specifiy the bitwidth
inplace – carry out model transformations inplace, the original module is mutated
mapping – maps type of a submodule to a type of corresponding dynamically quantized version with which the submodule needs to be replaced

torch.quantization.
quantize_qat
(model, run_fn, run_args, inplace=False)[source]¶ Do quantization aware training and output a quantized model
 Parameters
model – input model
run_fn – a function for evaluating the prepared model, can be a function that simply runs the prepared model or a training loop
run_args – positional arguments for run_fn
 Returns
Quantized model.

torch.quantization.
prepare
(model, qconfig_dict=None, inplace=False)[source]¶ Prepares a copy of the model for quantization calibration or quantizationaware training.
Quantization configuration can be passed as an qconfig_dict or assigned preemptively to individual submodules in .qconfig attribute.
The model will be attached with observer or fake quant modules, and qconfig will be propagated.
 Parameters
model – input model to be modified inplace
qconfig_dict – dictionary that maps from name or type of submodule to quantization configuration, qconfig applies to all submodules of a given module unless qconfig for the submodules are specified (when the submodule already has qconfig attribute)
inplace – carry out model transformations inplace, the original module is mutated

torch.quantization.
prepare_qat
(model, mapping={<class 'torch.nn.modules.linear.Linear'>: <class 'torch.nn.qat.modules.linear.Linear'>, <class 'torch.nn.modules.conv.Conv2d'>: <class 'torch.nn.qat.modules.conv.Conv2d'>, <class 'torch.nn.intrinsic.modules.fused.ConvBn2d'>: <class 'torch.nn.intrinsic.qat.modules.conv_fused.ConvBn2d'>, <class 'torch.nn.intrinsic.modules.fused.ConvBnReLU2d'>: <class 'torch.nn.intrinsic.qat.modules.conv_fused.ConvBnReLU2d'>, <class 'torch.nn.intrinsic.modules.fused.ConvReLU2d'>: <class 'torch.nn.intrinsic.qat.modules.conv_fused.ConvReLU2d'>, <class 'torch.nn.intrinsic.modules.fused.LinearReLU'>: <class 'torch.nn.intrinsic.qat.modules.linear_relu.LinearReLU'>}, inplace=False)[source]¶

torch.quantization.
convert
(module, mapping=None, inplace=False)[source]¶ Converts the float module with observers (where we can get quantization parameters) to a quantized module.
 Parameters
module – calibrated module with observers
mapping – a dictionary that maps from float module type to quantized module type, can be overwrritten to allow swapping user defined Modules
inplace – carry out model transformations inplace, the original module is mutated

class
torch.quantization.
QConfig
[source]¶ Describes how to quantize a layer or a part of the network by providing settings (observer classes) for activations and weights respectively.
Note that QConfig needs to contain observer classes (like MinMaxObserver) or a callable that returns instances on invocation, not the concrete observer instances themselves. Quantization preparation function will instantiate observers multiple times for each of the layers.
Observer classes have usually reasonable default arguments, but they can be overwritten with with_args method (that behaves like functools.partial):
 my_qconfig = QConfig(activation=MinMaxObserver.with_args(dtype=torch.qint8),
weight=default_observer.with_args(dtype=torch.qint8))

class
torch.quantization.
QConfigDynamic
[source]¶ Describes how to dynamically quantize a layer or a part of the network by providing settings (observer classe) for weights.
It’s like QConfig, but for dynamic quantization.
Note that QConfigDynamic needs to contain observer classes (like MinMaxObserver) or a callable that returns instances on invocation, not the concrete observer instances themselves. Quantization function will instantiate observers multiple times for each of the layers.
Observer classes have usually reasonable default arguments, but they can be overwritten with with_args method (that behaves like functools.partial):
my_qconfig = QConfigDynamic(weight=default_observer.with_args(dtype=torch.qint8))
Preparing model for quantization¶

torch.quantization.
fuse_modules
(model, modules_to_fuse, inplace=False, fuser_func=<function fuse_known_modules>)[source]¶ Fuses a list of modules into a single module
Fuses only the following sequence of modules: conv, bn conv, bn, relu conv, relu linear, relu All other sequences are left unchanged. For these sequences, replaces the first item in the list with the fused module, replacing the rest of the modules with identity.
 Parameters
model – Model containing the modules to be fused
modules_to_fuse – list of list of module names to fuse. Can also be a list of strings if there is only a single list of modules to fuse.
inplace – bool specifying if fusion happens in place on the model, by default a new model is returned
fuser_func – Function that takes in a list of modules and outputs a list of fused modules of the same length. For example, fuser_func([convModule, BNModule]) returns the list [ConvBNModule, nn.Identity()] Defaults to torch.quantization.fuse_known_modules
 Returns
model with fused modules. A new copy is created if inplace=True.
Examples:
>>> m = myModel() >>> # m is a module containing the submodules below >>> modules_to_fuse = [ ['conv1', 'bn1', 'relu1'], ['submodule.conv', 'submodule.relu']] >>> fused_m = torch.quantization.fuse_modules(m, modules_to_fuse) >>> output = fused_m(input) >>> m = myModel() >>> # Alternately provide a single list of modules to fuse >>> modules_to_fuse = ['conv1', 'bn1', 'relu1'] >>> fused_m = torch.quantization.fuse_modules(m, modules_to_fuse) >>> output = fused_m(input)

class
torch.quantization.
QuantStub
(qconfig=None)[source]¶ Quantize stub module, before calibration, this is same as an observer, it will be swapped as nnq.Quantize in convert.
 Parameters
qconfig – quantization configuration for the tensor, if qconfig is not provided, we will get qconfig from parent modules

class
torch.quantization.
DeQuantStub
[source]¶ Dequantize stub module, before calibration, this is same as identity, this will be swapped as nnq.DeQuantize in convert.

class
torch.quantization.
QuantWrapper
(module)[source]¶ A wrapper class that wraps the input module, adds QuantStub and DeQuantStub and surround the call to module with call to quant and dequant modules.
This is used by the quantization utility functions to add the quant and dequant modules, before convert function QuantStub will just be observer, it observes the input tensor, after convert, QuantStub will be swapped to nnq.Quantize which does actual quantization. Similarly for DeQuantStub.

torch.quantization.
add_quant_dequant
(module)[source]¶ Wrap the leaf child module in QuantWrapper if it has a valid qconfig Note that this function will modify the children of module inplace and it can return a new module which wraps the input module as well.
 Parameters
module – input module with qconfig attributes for all the leaf modules that we want to quantize
 Returns
Either the inplace modified module with submodules wrapped in QuantWrapper based on qconfig or a new QuantWrapper module which wraps the input module, the latter case only happens when the input module is a leaf module and we want to quantize it.
Utility functions¶

torch.quantization.
add_observer_
(module)[source]¶ Add observer for the leaf child of the module.
This function insert observer module to all leaf child module that has a valid qconfig attribute.
 Parameters
module – input module with qconfig attributes for all the leaf modules that we want to quantize
 Returns
None, module is modified inplace with added observer modules and forward_hooks

torch.quantization.
swap_module
(mod, mapping)[source]¶ Swaps the module if it has a quantized counterpart and it has an observer attached.
 Parameters
mod – input module
mapping – a dictionary that maps from nn module to nnq module
 Returns
The corresponding quantized module of mod

torch.quantization.
propagate_qconfig_
(module, qconfig_dict=None)[source]¶ Propagate qconfig through the module hierarchy and assign qconfig attribute on each leaf module
 Parameters
module – input module
qconfig_dict – dictionary that maps from name or type of submodule to quantization configuration, qconfig applies to all submodules of a given module unless qconfig for the submodules are specified (when the submodule already has qconfig attribute)
 Returns
None, module is modified inplace with qconfig attached
Observers¶

class
torch.quantization.
Observer
(dtype)[source]¶ Observer base Module. Any observer implementation should derive from this class.
Concrete observers should follow the same API. In forward, they will update the statistics of the observed Tensor. And they should provide a calculate_qparams function that computes the quantization parameters given the collected statistics.

classmethod
with_args
(**kwargs)¶ Wrapper around functools.partial that allows chaining.
Often you want to assign it to a class as a class method:
Foo.with_args = classmethod(_with_args) Foo.with_args(x=1).with_args(y=2)

classmethod

class
torch.quantization.
MinMaxObserver
(**kwargs)[source]¶ Default Observer Module A default implementation of the observer module, only works for per_tensor_affine quantization scheme. The module will record the running average of max and min value of the observed Tensor and calculate_qparams will calculate scale and zero_point

class
torch.quantization.
PerChannelMinMaxObserver
(ch_axis=0, **kwargs)[source]¶ Per Channel Observer Module The module will record the running average of max and min value for each channel of the observed Tensor and calculate_qparams will calculate scales and zero_points for each channel

class
torch.quantization.
MovingAveragePerChannelMinMaxObserver
(averaging_constant=0.01, **kwargs)[source]¶ Per Channel Observer Module The module will record the running average of max and min value for each channel of the observed Tensor and calculate_qparams will calculate scales and zero_points for each channel

class
torch.quantization.
HistogramObserver
(bins=2048, **kwargs)[source]¶ The module records the running histogram of tensor values along with min/max values. calculate_qparams will calculate scale and zero_point

class
torch.quantization.
FakeQuantize
(observer=<class 'torch.quantization.observer.MovingAverageMinMaxObserver'>, quant_min=0, quant_max=255, **observer_kwargs)[source]¶ Simulate the quantize and dequantize operations in training time. The output of this module is given by
x_out = (clamp(round(x/scale + zero_point), quant_min, quant_max)zero_point)*scale
scale
defines the scale factor used for quantization.zero_point
specifies the quantized value to which 0 in floating point maps toquant_min
specifies the minimum allowable quantized value.quant_max
specifies the maximum allowable quantized value.fake_quant_enable
controls the application of fake quantization on tensors, note that statistics can still be updated.observer_enable
controls statistics collection on tensorsdtype
specifies the quantized dtype that is being emulated with fakequantization,allowable values are torch.qint8 and torch.quint8. The values of quant_min and quant_max should be chosen to be consistent with the dtype
 Parameters
 Variables
~FakeQuantize.observer (Module) – User provided module that collects statistics on the input tensor and provides a method to calculate scale and zeropoint.
Debugging utilities¶

torch.quantization.
get_observer_dict
(mod, target_dict, prefix='')[source]¶ Traverse the modules and save all observers into dict. This is mainly used for quantization accuracy debug :param mod: the top module we want to save all observers :param prefix: the prefix for the current module :param target_dict: the dictionary used to save all the observers
torch.nn.instrinsic¶
This module implements the combined (fused) modules conv + relu which can be then quantized.
torch.nn.instrinsic.qat¶
This module implements the versions of those fused operations needed for quantization aware training.
ConvBn2d¶

class
torch.nn.intrinsic.qat.
ConvBn2d
(in_channels, out_channels, kernel_size, stride=1, padding=0, dilation=1, groups=1, padding_mode='zeros', eps=1e05, momentum=0.1, freeze_bn=False, qconfig=None)[source]¶ A ConvBn2d module is a module fused from Conv2d and BatchNorm2d, attached with FakeQuantize modules for both output activation and weight, used in quantization aware training.
We combined the interface of
torch.nn.Conv2d
andtorch.nn.BatchNorm2d
.Implementation details: https://arxiv.org/pdf/1806.08342.pdf section 3.2.2
Similar to
torch.nn.Conv2d
, with FakeQuantize modules initialized to default. Variables
~ConvBn2d.freeze_bn –
~ConvBn2d.observer – fake quant module for output activation, it’s called observer to align with post training flow
~ConvBn2d.weight_fake_quant – fake quant module for weight
ConvBnReLU2d¶

class
torch.nn.intrinsic.qat.
ConvBnReLU2d
(in_channels, out_channels, kernel_size, stride=1, padding=0, dilation=1, groups=1, padding_mode='zeros', eps=1e05, momentum=0.1, freeze_bn=False, qconfig=None)[source]¶ A ConvBnReLU2d module is a module fused from Conv2d, BatchNorm2d and ReLU, attached with FakeQuantize modules for both output activation and weight, used in quantization aware training.
We combined the interface of
torch.nn.Conv2d
andtorch.nn.BatchNorm2d
andtorch.nn.ReLU
.Implementation details: https://arxiv.org/pdf/1806.08342.pdf
Similar to torch.nn.Conv2d, with FakeQuantize modules initialized to default.
 Variables
~ConvBnReLU2d.observer – fake quant module for output activation, it’s called observer to align with post training flow
~ConvBnReLU2d.weight_fake_quant – fake quant module for weight
ConvReLU2d¶

class
torch.nn.intrinsic.qat.
ConvReLU2d
(in_channels, out_channels, kernel_size, stride=1, padding=0, dilation=1, groups=1, bias=True, padding_mode='zeros', qconfig=None)[source]¶ A ConvReLU2d module is a fused module of Conv2d and ReLU, attached with FakeQuantize modules for both output activation and weight for quantization aware training.
We combined the interface of
Conv2d
andBatchNorm2d
. Variables
~ConvReLU2d.observer – fake quant module for output activation, it’s called observer to align with post training flow
~ConvReLU2d.weight_fake_quant – fake quant module for weight
LinearReLU¶

class
torch.nn.intrinsic.qat.
LinearReLU
(in_features, out_features, bias=True, qconfig=None)[source]¶ A LinearReLU module fused from Linear and ReLU modules, attached with FakeQuantize modules for output activation and weight, used in quantization aware training.
We adopt the same interface as
torch.nn.Linear
.Similar to torch.nn.intrinsic.LinearReLU, with FakeQuantize modules initialized to default.
 Variables
~LinearReLU.observer – fake quant module for output activation, it’s called observer to align with post training flow, TODO: rename?
~LinearReLU.weight – fake quant module for weight
Examples:
>>> m = nn.qat.LinearReLU(20, 30) >>> input = torch.randn(128, 20) >>> output = m(input) >>> print(output.size()) torch.Size([128, 30])
torch.nn.intrinsic.quantized¶
This module implements the quantized implementations of fused operations like conv + relu.
ConvReLU2d¶

class
torch.nn.intrinsic.quantized.
ConvReLU2d
(in_channels, out_channels, kernel_size, stride=1, padding=0, dilation=1, groups=1, bias=True, padding_mode='zeros')[source]¶ A ConvReLU2d module is a fused module of Conv2d and ReLU
We adopt the same interface as
torch.nn.quantized.Conv2d
. Variables
as torch.nn.quantized.Conv2d (Same) –
LinearReLU¶

class
torch.nn.intrinsic.quantized.
LinearReLU
(in_features, out_features, bias=True)[source]¶ A LinearReLU module fused from Linear and ReLU modules
We adopt the same interface as
torch.nn.quantized.Linear
. Variables
as torch.nn.quantized.Linear (Same) –
Examples:
>>> m = nn.intrinsic.LinearReLU(20, 30) >>> input = torch.randn(128, 20) >>> output = m(input) >>> print(output.size()) torch.Size([128, 30])
torch.nn.qat¶
This module implements versions of the key nn modules Conv2d() and Linear() which run in FP32 but with rounding applied to simulate the effect of INT8 quantization.
Conv2d¶

class
torch.nn.qat.
Conv2d
(in_channels, out_channels, kernel_size, stride=1, padding=0, dilation=1, groups=1, bias=True, padding_mode='zeros', qconfig=None)[source]¶ A Conv2d module attached with FakeQuantize modules for both output activation and weight, used for quantization aware training.
We adopt the same interface as torch.nn.Conv2d, please see https://pytorch.org/docs/stable/nn.html?highlight=conv2d#torch.nn.Conv2d for documentation.
Similar to torch.nn.Conv2d, with FakeQuantize modules initialized to default.
 Variables
~Conv2d.observer – fake quant module for output activation, it’s called observer to align with post training flow
~Conv2d.weight_fake_quant – fake quant module for weight
Linear¶

class
torch.nn.qat.
Linear
(in_features, out_features, bias=True, qconfig=None)[source]¶ A linear module attached with FakeQuantize modules for both output activation and weight, used for quantization aware training.
We adopt the same interface as torch.nn.Linear, please see https://pytorch.org/docs/stable/nn.html#torch.nn.Linear for documentation.
Similar to torch.nn.Linear, with FakeQuantize modules initialized to default.
 Variables
~Linear.observer – fake quant module for output activation, it’s called observer to align with post training flow
~Linear.weight – fake quant module for weight
torch.nn.quantized¶
This module implements the quantized versions of the nn layers such as Conv2d and ReLU.
Functional interface¶
Functional interface (quantized).

torch.nn.quantized.functional.
relu
(input, inplace=False) → Tensor[source]¶ Applies the rectified linear unit function elementwise. See
ReLU
for more details. Parameters
input – quantized input
inplace – perform the computation inplace

torch.nn.quantized.functional.
linear
(input, weight, bias=None, scale=None, zero_point=None)[source]¶ Applies a linear transformation to the incoming quantized data: $y = xA^T + b$ . See
Linear
Note
Current implementation packs weights on every call, which has penalty on performance. If you want to avoid the overhead, use
Linear
. Parameters
input (Tensor) – Quantized input of type torch.quint8
weight (Tensor) – Quantized weight of type torch.qint8
bias (Tensor) – None or fp32 bias of type torch.float
scale (double) – output scale. If None, derived from the input scale
zero_point (long) – output zero point. If None, derived from the input zero_point
 Shape:
Input: $(N, *, in\_features)$ where * means any number of additional dimensions
Weight: $(out\_features, in\_features)$
Bias: $(out\_features)$
Output: $(N, *, out\_features)$

torch.nn.quantized.functional.
conv2d
(input, weight, bias, stride=1, padding=0, dilation=1, groups=1, padding_mode='zeros', scale=1.0, zero_point=0, dtype=torch.quint8)[source]¶ Applies a 2D convolution over a quantized 2D input composed of several input planes.
See
Conv2d
for details and output shape. Parameters
input – quantized input tensor of shape $(\text{minibatch} , \text{in\_channels} , iH , iW)$
weight – quantized filters of shape $(\text{out\_channels} , \frac{\text{in\_channels}}{\text{groups}} , kH , kW)$
bias – nonquantized bias tensor of shape $(\text{out\_channels})$ . The tensor type must be torch.float.
stride – the stride of the convolving kernel. Can be a single number or a tuple (sH, sW). Default: 1
padding – implicit paddings on both sides of the input. Can be a single number or a tuple (padH, padW). Default: 0
dilation – the spacing between kernel elements. Can be a single number or a tuple (dH, dW). Default: 1
groups – split input into groups, $\text{in\_channels}$ should be divisible by the number of groups. Default: 1
padding_mode – the padding mode to use. Only “zeros” is supported for quantized convolution at the moment. Default: “zeros”
scale – quantization scale for the output. Default: 1.0
zero_point – quantization zero_point for the output. Default: 0
dtype – quantization data type to use. Default:
torch.quint8
Examples:
>>> from torch.nn.quantized import functional as qF >>> filters = torch.randn(8, 4, 3, 3, dtype=torch.float) >>> inputs = torch.randn(1, 4, 5, 5, dtype=torch.float) >>> bias = torch.randn(4, dtype=torch.float) >>> >>> scale, zero_point = 1.0, 0 >>> dtype = torch.quint8 >>> >>> q_filters = torch.quantize_per_tensor(filters, scale, zero_point, dtype) >>> q_inputs = torch.quantize_per_tensor(inputs, scale, zero_point, dtype) >>> qF.conv2d(q_inputs, q_filters, bias, scale, zero_point, padding=1)

torch.nn.quantized.functional.
max_pool2d
(input, kernel_size, stride=None, padding=0, dilation=1, ceil_mode=False, return_indices=False)[source]¶ Applies a 2D max pooling over a quantized input signal composed of several quantized input planes.
Note
The input quantization parameters are propagated to the output.
See
MaxPool2d
for details.
ReLU¶

class
torch.nn.quantized.
ReLU
(inplace=False)[source]¶ Applies quantized rectified linear unit function elementwise:
$\text{ReLU}(x)= \max(x_0, x)$ , where $x_0$ is the zero point.
Please see https://pytorch.org/docs/stable/nn.html#torch.nn.ReLU for more documentation on ReLU.
 Parameters
inplace – (Currently not supported) can optionally do the operation inplace.
 Shape:
Input: $(N, *)$ where * means, any number of additional dimensions
Output: $(N, *)$ , same shape as the input
Examples:
>>> m = nn.quantized.ReLU() >>> input = torch.randn(2) >>> input = torch.quantize_per_tensor(input, 1.0, 0, dtype=torch.qint32) >>> output = m(input)
ReLU6¶

class
torch.nn.quantized.
ReLU6
(inplace=False)[source]¶ Applies the elementwise function:
$\text{ReLU6}(x) = \min(\max(x_0, x), q(6))$ , where $x_0$ is the zero_point, and $q(6)$ is the quantized representation of number 6.
 Parameters
inplace – can optionally do the operation inplace. Default:
False
 Shape:
Input: $(N, *)$ where * means, any number of additional dimensions
Output: $(N, *)$ , same shape as the input
Examples:
>>> m = nn.quantized.ReLU6() >>> input = torch.randn(2) >>> input = torch.quantize_per_tensor(input, 1.0, 0, dtype=torch.qint32) >>> output = m(input)
Conv2d¶

class
torch.nn.quantized.
Conv2d
(in_channels, out_channels, kernel_size, stride=1, padding=0, dilation=1, groups=1, bias=True, padding_mode='zeros')[source]¶ Applies a 2D convolution over a quantized input signal composed of several quantized input planes.
For details on input arguments, parameters, and implementation see
Conv2d
.Note
Only zeros is supported for the
padding_mode
argument.Note
Only torch.quint8 is supported for the input data type.
 Variables
See
Conv2d
for other attributes.Examples:
>>> # With square kernels and equal stride >>> m = nn.quantized.Conv2d(16, 33, 3, stride=2) >>> # nonsquare kernels and unequal stride and with padding >>> m = nn.quantized.Conv2d(16, 33, (3, 5), stride=(2, 1), padding=(4, 2)) >>> # nonsquare kernels and unequal stride and with padding and dilation >>> m = nn.quantized.Conv2d(16, 33, (3, 5), stride=(2, 1), padding=(4, 2), dilation=(3, 1)) >>> input = torch.randn(20, 16, 50, 100) >>> # quantize input to qint8 >>> q_input = torch.quantize_per_tensor(input, scale=1.0, zero_point=0, dtype=torch.qint32) >>> output = m(input)
FloatFunctional¶

class
torch.nn.quantized.
FloatFunctional
[source]¶ State collector class for float operatitons.
The instance of this class can be used instead of the
torch.
prefix for some operations. See example usage below.Note
This class does not provide a
forward
hook. Instead, you must use one of the underlying functions (e.g.add
). Valid operation names:
add
cat
mul
add_relu
add_scalar
mul_scalar
QFunctional¶

class
torch.nn.quantized.
QFunctional
[source]¶ Wrapper class for quantized operatitons.
The instance of this class can be used instead of the
torch.ops.quantized
prefix. See example usage below.Note
This class does not provide a
forward
hook. Instead, you must use one of the underlying functions (e.g.add
). Valid operation names:
add
cat
mul
add_relu
add_scalar
mul_scalar
Quantize¶

class
torch.nn.quantized.
Quantize
(scale, zero_point, dtype)[source]¶ Quantizes an incoming tensor :param out_scale: scale of the output Quantized Tensor :param out_zero_point: zero_point of output Quantized Tensor :param out_dtype: data type of output Quantized Tensor
 Variables
out_zero_point, out_dtype (`out_scale`,) –
 Examples::
>>> t = torch.tensor([[1., 1.], [1., 1.]]) >>> scale, zero_point, dtype = 1.0, 2, torch.qint8 >>> qm = Quantize(scale, zero_point, dtype) >>> qt = qm(t) >>> print(qt) tensor([[ 1., 1.], [ 1., 1.]], size=(2, 2), dtype=torch.qint8, scale=1.0, zero_point=2)
DeQuantize¶

class
torch.nn.quantized.
DeQuantize
[source]¶ Dequantizes an incoming tensor
 Examples::
>>> input = torch.tensor([[1., 1.], [1., 1.]]) >>> scale, zero_point, dtype = 1.0, 2, torch.qint8 >>> qm = Quantize(scale, zero_point, dtype) >>> quantized_input = qm(input) >>> dqm = DeQuantize() >>> dequantized = dqm(quantized_input) >>> print(dequantized) tensor([[ 1., 1.], [ 1., 1.]], dtype=torch.float32)
Linear¶

class
torch.nn.quantized.
Linear
(in_features, out_features, bias_=True)[source]¶ A quantized linear module with quantized tensor as inputs and outputs. We adopt the same interface as torch.nn.Linear, please see https://pytorch.org/docs/stable/nn.html#torch.nn.Linear for documentation.
Similar to
Linear
, attributes will be randomly initialized at module creation time and will be overwritten later Variables
~Linear.weight (Tensor) – the nonlearnable quantized weights of the module of shape $(\text{out\_features}, \text{in\_features})$ .
~Linear.bias (Tensor) – the nonlearnable bias of the module of shape $(\text{out\_features})$ . If
bias
isTrue
, the values are initialized to zero.~Linear.scale – scale parameter of output Quantized Tensor, type: double
~Linear.zero_point – zero_point parameter for output Quantized Tensor, type: long
Examples:
>>> m = nn.quantized.Linear(20, 30) >>> input = torch.randn(128, 20) >>> input = torch.quantize_per_tensor(input, 1.0, 0, torch.quint8) >>> output = m(input) >>> print(output.size()) torch.Size([128, 30])
torch.nn.quantized.dynamic¶
Linear¶

class
torch.nn.quantized.dynamic.
Linear
(in_features, out_features, bias_=True)[source]¶ A dynamic quantized linear module with quantized tensor as inputs and outputs. We adopt the same interface as torch.nn.Linear, please see https://pytorch.org/docs/stable/nn.html#torch.nn.Linear for documentation.
Similar to
torch.nn.Linear
, attributes will be randomly initialized at module creation time and will be overwritten later Variables
~Linear.weight (Tensor) – the nonlearnable quantized weights of the module which are of shape $(\text{out\_features}, \text{in\_features})$ .
~Linear.bias (Tensor) – the nonlearnable bias of the module of shape $(\text{out\_features})$ . If
bias
isTrue
, the values are initialized to zero.~Linear.scale – scale parameter of weight Quantized Tensor, type: double
~Linear.zero_point – zero_point parameter for weight Quantized Tensor, type: long
Examples:
>>> m = nn.quantized.dynamic.Linear(20, 30) >>> input = torch.randn(128, 20) >>> output = m(input) >>> print(output.size()) torch.Size([128, 30])