Quantization Accuracy Debugging

This document provides high level strategies for improving quantization accuracy. If a quantized model has error compared to the original model, we can categorize the error into:

data insensitive error - caused by intrinsic model quantization error, large portion of input data has large error
data sensitive error - caused by outlier input data, small portion of input data has large error
implementation error - quantized kernel is not matching reference implementation

Data insensitive error

General tips

For PTQ, ensure that the data you are calibrating with is representative of your dataset. For example, for a classification problem a general guideline is to have multiple samples in every category, and the overall number of samples should be at least 100. There is no penalty for calibrating with more data other than calibration time.
If your model has Conv-BN or Linear-BN patterns, consider fusing them. If you are using FX graph mode quantization, this is done automatically by the workflow. If you are using Eager mode quantization, you can do this manually with the torch.ao.quantization.fuse_modules API.
Increase the precision of dtype of the problematic ops. Usually, fp32 will have the highest accuracy, followed by fp16, followed by dynamically quantized int8, followed by statically quantized int8.
1. Note: this is trading off performance for accuracy.
2. Note: availability of kernels per dtype per op can vary by backend.
3. Note: dtype conversions add an additional performance cost. For example, fp32_op -> quant -> int8_op -> dequant -> fp32_op -> quant -> int8_op -> dequant will have a performance penalty compared to fp32_op -> fp32_op -> quant -> int8_op -> int8_op -> dequant because of a higher number of required dtype conversions.
If you are using PTQ, consider using QAT to recover some of the accuracy loss from quantization.

Int8 quantization tips

If you are using per-tensor weight quantization, consider using per-channel weight quantization.
If you are doing inference on fbgemm, ensure that you set the reduce_range argument to False if your CPU is Cooperlake or newer, and to True otherwise.
Audit the input activation distribution variation across different samples. If this variation is high, the layer may be suitable for dynamic quantization but not static quantization.

Data sensitive error

If you are using static quantization and a small portion of your input data is resulting in high quantization error, you can try:

Adjust your calibration dataset to make it more representative of your inference dataset.
Manually inspect (using Numeric Suite) which layers have high quantization error. For these layers, consider leaving them in floating point or adjusting the observer settings to choose a better scale and zero_point.

Implementation error

If you are using PyTorch quantization with your own backend you may see differences between the reference implementation of an operation (such as dequant -> op_fp32 -> quant) and the quantized implementation (such as op_int8) of the op on the target hardware. This could mean one of two things:

the differences (usually small) are expected due to specific behavior of the target kernel on the target hardware compared to fp32/cpu. An example of this is accumulating in an integer dtype. Unless the kernel guarantees bitwise equivalency with the reference implementation, this is expected.
the kernel on the target hardware has an accuracy issue. In this case, reach out to the kernel developer.

Numerical Debugging Tooling (prototype)

Warning

Numerical debugging tooling is early prototype and subject to change.

torch.ao.ns._numeric_suite Eager mode numeric suite
torch.ao.ns._numeric_suite_fx FX numeric suite

Quantization Accuracy Debugging

Data insensitive error

General tips

Int8 quantization tips

Data sensitive error

Implementation error

Numerical Debugging Tooling (prototype)

Docs

Tutorials

Resources