Quantized Operations for XLA device (Experimental feature)¶
This document outlines how to utilize quantized operations to enable quantization on XLA devices.
XLA Quantized ops offer a high-level abstraction for quantized operations (e.g., blockwise int4 quantized matrix multiplication). These ops are analogous to quantized CUDA kernels (example) in the CUDA ecosystem, providing similar functionality and performance benefits within the XLA framework.
NOTE: Currently this is classified as experimental feature. It’s API specifics will change in the next (2.5) release.
How to use:¶
XLA quantized operations can be used as torch op
, or a torch.nn.Module
that wraps the torch.op
. These 2 options give model developers the flexibility to choose the best way to integrate XLA quantized ops into their solution.
Both torch op
and nn.Module
are compatible with torch.compile( backend='openxla')
.
Call XLA quantized op in model code¶
Users can call XLA quantized ops in the same way as calling other regular PyTorch ops. This provides maximum flexibility in integrating XLA quantized ops into their applications. The quantized ops work in both eager mode and Dynamo, with regular PyTorch CPU tensor and XLA tensor.
Note Please check the docstring of the quantized ops for the layout of the quantized weights.
import torch
import torch_xla.core.xla_model as xm
import torch_xla.experimental.xla_quantized_matmul
N_INPUT_FEATURES=10
N_OUTPUT_FEATURES=20
x = torch.randn((3, N_INPUT_FEATURES), dtype=torch.bfloat16)
w_int = torch.randint(-128, 127, (N_OUTPUT_FEATURES, N_INPUT_FEATURES), dtype=torch.int8)
scaler = torch.randn((N_OUTPUT_FEATURES,), dtype=torch.bfloat16)
# Call with torch CPU tensor (For debugging purpose)
matmul_output = torch.ops.xla.quantized_matmul(x, w_int, scaler)
device = xm.xla_device()
x_xla = x.to(device)
w_int_xla = w_int.to(device)
scaler_xla = scaler.to(device)
# Call with XLA Tensor to run on XLA device
matmul_output_xla = torch.ops.xla.quantized_matmul(x_xla, w_int_xla, scaler_xla)
# Use with torch.compile(backend='openxla')
def f(x, w, s):
return torch.ops.xla.quantized_matmul(x, w, s)
f_dynamo = torch.compile(f, backend="openxla")
dynamo_out_xla = f_dynamo(x_xla, w_int_xla, scaler_xla)
It’s common to wrap the quantized op into a custom nn.Module
in model developers model code:
class MyQLinearForXLABackend(torch.nn.Module):
def __init__(self):
self.weight = ...
self.scaler = ...
def load_weight(self, w, scaler):
# Load quantized Linear weights
# Customized way to preprocess the weights
...
self.weight = processed_w
self.scaler = processed_scaler
def forward(self, x):
# Do some random stuff with x
...
matmul_output = torch.ops.xla.quantized_matmul(x, self.weight, self.scaler)
# Do some random stuff with matmul_output
...
Module Swap¶
Alternatively, users can also use the nn.Module
that wraps the XLA quantized ops and do module swap in the model code:
orig_model = MyModel()
# Quantize the model and get quantized weights
q_weights = quantize(orig_model)
# Process the quantized weight to the format that XLA quantized op expects.
q_weights_for_xla = process_for_xla(q_weights)
# Do module swap
q_linear = XlaQuantizedLinear(self.linear.in_features,
self.linear.out_features)
q_linear.load_quantized_weight(q_weights_for_xla)
orig_model.linear = q_linear
Supported Quantized Operations:¶
Matrix Multiply¶
Weight Quantization Type |
Activation Quantization Type |
Dtype |
Supported |
---|---|---|---|
per-channel (sym/asym) |
N/A |
W8A16 |
Yes |
per-channel (sym/asym) |
N/A |
W4A16 |
Yes |
per-channel |
per-token |
W8A8 |
No |
per-channel |
per-token |
W4A8 |
No |
blockwise (sym/asym) |
N/A |
W8A16 |
Yes |
blockwise (sym/asym) |
N/A |
W4A16 |
Yes |
blockwise |
per-token |
W8A8 |
No |
blockwise |
per-token |
W4A8 |
No |
Note W[X]A[Y]
refers to Weight in X
-bit, Activation in Y
-bit. If X/Y
is 4 or 8, it refers to int4/8
. 16 for bfloat16
format.
Embedding¶
To be added