Shortcuts

torchao.quantization

Main Quantization APIs

quantize_

Convert the weight of linear modules in the model with config, model is modified inplace

autoquant

Autoquantization is a process which identifies the fastest way to quantize each layer of a model over some set of potential qtensor subclasses.

Quantization APIs for quantize_

int4_weight_only

alias of Int4WeightOnlyConfig

int8_weight_only

alias of Int8WeightOnlyConfig

int8_dynamic_activation_int4_weight

alias of Int8DynamicActivationInt4WeightConfig

int8_dynamic_activation_int8_weight

alias of Int8DynamicActivationInt8WeightConfig

uintx_weight_only

alias of UIntXWeightOnlyConfig

gemlite_uintx_weight_only

alias of GemliteUIntXWeightOnlyConfig

intx_quantization_aware_training

alias of IntXQuantizationAwareTrainingConfig

float8_weight_only

alias of Float8WeightOnlyConfig

float8_dynamic_activation_float8_weight

alias of Float8DynamicActivationFloat8WeightConfig

float8_static_activation_float8_weight

alias of Float8StaticActivationFloat8WeightConfig

fpx_weight_only

alias of FPXWeightOnlyConfig

Quantization Primitives

choose_qparams_affine

param input:

fp32, bf16, fp16 input Tensor

choose_qparams_affine_with_min_max

A variant of choose_qparams_affine() operator that pass in min_val and max_val directly instead of deriving these from a single input.

choose_qparams_affine_floatx

quantize_affine

param input:

original float32, float16 or bfloat16 Tensor

quantize_affine_floatx

Quantizes the float32 high precision floating point tensor to low precision floating point number and converts the result to unpacked floating point format with the format of 00SEEEMM (for fp6_e3m2) where S means sign bit, e means exponent bit and m means mantissa bit

dequantize_affine

param input:

quantized tensor, should match the dtype dtype argument

dequantize_affine_floatx

choose_qparams_and_quantize_affine_hqq

fake_quantize_affine

General fake quantize op for quantization-aware training (QAT).

fake_quantize_affine_cachemask

General fake quantize op for quantization-aware training (QAT).

safe_int_mm

Performs a safe integer matrix multiplication, considering different paths for torch.compile, cublas, and fallback cases.

int_scaled_matmul

Performs scaled integer matrix multiplication.

MappingType

How floating point number is mapped to integer number

ZeroPointDomain

Enum that indicate whether zero_point is in integer domain or floating point domain

TorchAODType

Placeholder for dtypes that do not exist in PyTorch core yet.

Other

to_linear_activation_quantized

swap_linear_with_smooth_fq_linear

Replaces linear layers in the model with their SmoothFakeDynamicallyQuantizedLinear equivalents.

smooth_fq_linear_to_inference

Prepares the model for inference by calculating the smoothquant scale for each SmoothFakeDynamicallyQuantizedLinear layer.

Docs

Access comprehensive developer documentation for PyTorch

View Docs

Tutorials

Get in-depth tutorials for beginners and advanced developers

View Tutorials

Resources

Find development resources and get your questions answered

View Resources