PyTorch now offers native quantized variants of Phi4-mini-instruct, Qwen3, SmolLM3-3B and gemma-3-270m-it through a collaboration between the TorchAO team, ExecuTorch team, and Unsloth! These models leverage int4 and float8 quantization to deliver efficient inference on A100, H100, and mobile devices, all while maintaining minimal to no degradation in model quality compared to their bfloat16 counterparts. Highlights:
- We released pre-quantized models optimized for both server and mobile platforms: for users who want to deploy a faster model in production
- We released comprehensive, reproducible quantization recipes and guides that cover model quality evaluation and performance benchmarking: for users applying PyTorch native quantization to their own models and datasets
- You can also finetune with unsloth and quantize the finetuned model with TorchAO
Post Training Quantized Models and Reproducible Recipes
So far, we have released the following quantized variants of Phi4-mini-instruct, Qwen3, SmolLM3-3B, and gemma-3-270m-it:
Quantization methods | Results | Models |
Int4 weight only quantization with hqq algorithm and AWQ (for server H100 and A100 GPU) |
|
Phi-4-mini-instruct-INT4 Phi-4-mini-instruct-AWQ-INT4 Qwen3-8B-INT4 Qwen3-8B-AWQ-INT4 |
Float8 dynamic activation and float8 weight quantization (for server H100 GPU) |
|
gemma-3-270m-it-torchao-FP8 Phi-4-mini-instruct-FP8 Qwen3-32B-FP8 |
Int8 dynamic activation and int4 weight quantization (for mobile CPU) |
|
Phi-4-mini-instruct-INT8-INT4 Qwen3-4B-INT8-INT4 SmolLM3-3B-INT8-INT4 |
Each model mentioned has reproducible quantization recipes using the TorchAO library in its model card. This means you can use TorchAO to quantize other models as well.
Integrations
PyTorch native quantized models benefit from strong integrations across the PyTorch ecosystem, enabling robust, high-performance quantization solutions that meet diverse deployment needs.
Here is what we are using across the stack to quantize, finetune, evaluate model quality, latency and deploy the model. The released quantized models and quantization recipes work seamlessly throughout the lifecycle of model preparation and deployment.
What’s Next
- New Features
- MoE quantization for both inference and training
- New dtype support: NVFP4
- More accuracy preserving post training quantization techniques, e.g. SmoothQuant, GPTQ, SpinQuant
- Collaborations
- Continue to partner with unsloth to make TorchAO available to its users for finetuning, QAT, post training quantization, and releasing TorchAO quantized models
- We’re partnering with vLLM to get optimized end to end server inference performance, leveraging fast kernels from FBGEMM
Call To Action
Please try out our models and quantization recipes and let us know your thoughts by opening issues in TorchAO or start discussions in the released models page. You can also reach out to us in our Discord channel. We’d also love to learn how the community quantizes models today and partner on releasing quantized models on HuggingFace in the future.