Skip to main content
Blog

TorchAO Quantized Models and Quantization Recipes Now Available on HuggingFace Hub

PyTorch now offers native quantized variants of Phi4-mini-instruct, Qwen3, SmolLM3-3B and gemma-3-270m-it through a collaboration between the TorchAO team, ExecuTorch team, and Unsloth! These models leverage int4 and float8 quantization to deliver efficient inference on A100, H100, and mobile devices, all while maintaining minimal to no degradation in model quality compared to their bfloat16 counterparts. Highlights:

  • We released pre-quantized models optimized for both server and mobile platforms: for users who want to deploy a faster model in production
  • We released comprehensive, reproducible quantization recipes and guides that cover model quality evaluation and performance benchmarking: for users applying PyTorch native quantization to their own models and datasets
  • You can also finetune with unsloth and quantize the finetuned model with TorchAO

Post Training Quantized Models and Reproducible Recipes

So far, we have released the following quantized variants of Phi4-mini-instruct, Qwen3, SmolLM3-3B, and gemma-3-270m-it:

Quantization methods Results Models
Int4 weight only quantization with hqq algorithm and AWQ (for server H100 and A100 GPU)
  • 1.1-1.2x speedup on A100 over bfloat16 model and 1.75x speedup on H100 over bfloat16 model on batch size 1
  • Small accuracy degradation from bfloat16 model, for example, Phi4-mini-instruct-INT4 scored on average 53.28 while baseline bfloat16 scored 55.35 across 13 tasks that we evaluated, further details are available on the respective model cards
  • For tasks with large accuracy drops, e.g. Phi4-mini-instruct-INT4 scored 36.98 for mmlu_pro, and Phi4-mini-instruct-INT4-AWQ recovers accuracy to 43.13, with 2 samples of calibration data from mmlu_pro, further details are available on the respective model cards
  • 60% peak memory reduction
Phi-4-mini-instruct-INT4
Phi-4-mini-instruct-AWQ-INT4
Qwen3-8B-INT4
Qwen3-8B-AWQ-INT4
Float8 dynamic activation and float8 weight quantization (for server H100 GPU)
  • 1.7-2x speedup on H100 (depending on model size) over bfloat16 model on batch sizes 1 and 256
  • Little or no accuracy degradation from bfloat16 model, e.g. Phi-4-mini-instruct-FP8 scored on average 55.11 while baseline bfloat16 scored 55.35 across 13 tasks that we evaluated
  • 30-40% peak memory reduction
gemma-3-270m-it-torchao-FP8
Phi-4-mini-instruct-FP8
Qwen3-32B-FP8
Int8 dynamic activation and int4 weight quantization (for mobile CPU)
  • Small accuracy degradation from bfloat16 model
  • Enables model to run on iOS and Android devices, such as iPhone 15 Pro and Samsung Galaxy S22
Phi-4-mini-instruct-INT8-INT4
​​Qwen3-4B-INT8-INT4
SmolLM3-3B-INT8-INT4

 

Each model mentioned has reproducible quantization recipes using the TorchAO library in its model card. This means you can use TorchAO to quantize other models as well.

Integrations

PyTorch native quantized models benefit from strong integrations across the PyTorch ecosystem, enabling robust, high-performance quantization solutions that meet diverse deployment needs.

Here is what we are using across the stack to quantize, finetune, evaluate model quality, latency and deploy the model. The released quantized models and quantization recipes work seamlessly throughout the lifecycle of model preparation and deployment.

quantize, finetune, evaluate model quality, latency and deploy the model

What’s Next

  • New Features
    • MoE quantization for both inference and training
    • New dtype support: NVFP4
    • More accuracy preserving post training quantization techniques, e.g. SmoothQuant, GPTQ, SpinQuant
  • Collaborations
    • Continue to partner with unsloth to make TorchAO available to its users for finetuning, QAT, post training quantization, and releasing TorchAO quantized models
    • We’re partnering with vLLM to get optimized end to end server inference performance, leveraging fast kernels from FBGEMM

Call To Action

Please try out our models and quantization recipes and let us know your thoughts by opening issues in TorchAO or start discussions in the released models page. You can also reach out to us in our Discord channel. We’d also love to learn how the community quantizes models today and partner on releasing quantized models on HuggingFace in the future.