torchao.quantization

Main Quantization APIs

`quantize_`	Convert the weight of linear modules in the model with config, model is modified inplace
`autoquant`	Autoquantization is a process which identifies the fastest way to quantize each layer of a model over some set of potential qtensor subclasses.

`int4_weight_only`	alias of `Int4WeightOnlyConfig`
`int8_weight_only`	alias of `Int8WeightOnlyConfig`
`int8_dynamic_activation_int4_weight`	alias of `Int8DynamicActivationInt4WeightConfig`
`int8_dynamic_activation_int8_weight`	alias of `Int8DynamicActivationInt8WeightConfig`
`uintx_weight_only`	alias of `UIntXWeightOnlyConfig`
`gemlite_uintx_weight_only`	alias of `GemliteUIntXWeightOnlyConfig`
`intx_quantization_aware_training`	alias of `IntXQuantizationAwareTrainingConfig`
`float8_weight_only`	alias of `Float8WeightOnlyConfig`
`float8_dynamic_activation_float8_weight`	alias of `Float8DynamicActivationFloat8WeightConfig`
`float8_static_activation_float8_weight`	alias of `Float8StaticActivationFloat8WeightConfig`
`fpx_weight_only`	alias of `FPXWeightOnlyConfig`

`choose_qparams_affine`	param input: fp32, bf16, fp16 input Tensor
`choose_qparams_affine_with_min_max`	A variant of `choose_qparams_affine()` operator that pass in min_val and max_val directly instead of deriving these from a single input.
`choose_qparams_affine_floatx`
`quantize_affine`	param input: original float32, float16 or bfloat16 Tensor
`quantize_affine_floatx`	Quantizes the float32 high precision floating point tensor to low precision floating point number and converts the result to unpacked floating point format with the format of 00SEEEMM (for fp6_e3m2) where S means sign bit, e means exponent bit and m means mantissa bit
`dequantize_affine`	param input: quantized tensor, should match the dtype dtype argument
`dequantize_affine_floatx`
`choose_qparams_and_quantize_affine_hqq`
`fake_quantize_affine`	General fake quantize op for quantization-aware training (QAT).
`fake_quantize_affine_cachemask`	General fake quantize op for quantization-aware training (QAT).
`safe_int_mm`	Performs a safe integer matrix multiplication, considering different paths for torch.compile, cublas, and fallback cases.
`int_scaled_matmul`	Performs scaled integer matrix multiplication.
`MappingType`	How floating point number is mapped to integer number
`ZeroPointDomain`	Enum that indicate whether zero_point is in integer domain or floating point domain
`TorchAODType`	Placeholder for dtypes that do not exist in PyTorch core yet.

`to_linear_activation_quantized`
`swap_linear_with_smooth_fq_linear`	Replaces linear layers in the model with their SmoothFakeDynamicallyQuantizedLinear equivalents.
`smooth_fq_linear_to_inference`	Prepares the model for inference by calculating the smoothquant scale for each SmoothFakeDynamicallyQuantizedLinear layer.