Blog

Blog

Efficient MoE Pre-training at Scale on 1K AMD GPUs with TorchTitan

Training massive Mixture-of-Experts (MoE) models like DeepSeek-V3 and Llama 4-Scout efficiently is one of the…

AMD Contributors: Liz Li, Yanyuan Qin, Yuankai Chen, Xinyu Kang, Xiaobo Chen, Zhen Huang, Shekhar Pandey, Zhenyu Gu, Andy Luo, Meta Contributors: Matthias Reso, Hamid Shojanazeri, Tianyu Liu, Jiani Wang, Howard Huang, Wei Feng, Special Thanks: Guru MP, Yao Fu, Nick Ni, Emad Barsoum, Ramine Roane, and the TensorWave team for providing MI325 clusterDecember 1, 2025

Blog Community

The Future of Inference: PyTorch ATX Event

On September 17, 2025, PyTorch ATX partnered with the vLLM community and Red Hat to…

Jason Meaux, ATX PyTorch leader and Stephen Watt, PyTorch Ambassador, Red HatNovember 26, 2025

Blog

OpenReg: A Self-Contained PyTorch Accelerator Simulator

Introduction The PyTorch community is actively working to build a growing ecosystem of specialized accelerators…

Jiahao Chen (Huawei) & Jiawei Li (Huawei) & Zesheng Zong (Huawei)November 21, 2025

Blog Community

Beyond Quantization: Bringing Sparse Inference to PyTorch

As developers, we all know the story: Large Language Models (LLMs) are revolutionary, but their…

Kira Selby & Varun Khare (NimbleEdge)November 13, 2025

Blog

KernelFalcon: Autonomous GPU Kernel Generation via Deep Agents

Summary We introduce KernelFalcon, a deep agent architecture for generating GPU kernels that combines hierarchical…

Laura Wang and the PyTorch Team at MetaNovember 5, 2025

Blog

Hybrid Models as First-Class Citizens in vLLM

Introduction and Agenda Large language models are now running into the scaling limits of attention.…

vLLM Team at IBMNovember 5, 2025

Blog

Monarch + Lightning AI: Unlocking New Possibilities in Distributed Training

Introduction: Empowering the Next Generation of AI Builders We are excited to announce a partnership…

PyTorch Team at Meta: Alireza Shamsoshoara, Lucas Pasqualin, Peng Zhang, Hamid Shojanazeri, Ahmad Sharif, Kiuk Chung, Lightning AI: Lightning: Luca AntigaOctober 22, 2025

Blog

torchcomms: a modern PyTorch communications API

Introduction Torchcomms is a new experimental, lightweight communication API intended for use with PyTorch Distributed…

Team torchcomms at MetaOctober 22, 2025

Blog

Helion: A High-Level DSL for Performant and Portable ML Kernels

Introduction to Helion In modern machine learning, the demand for high-performance computation has led to…

PyTorch Team at MetaOctober 22, 2025

Blog

Introducing ExecuTorch 1.0: Powering the next generation of edge AI

TLDR ExecuTorch enables seamless, production-ready deployment of PyTorch models directly to edge devices (mobile, embedded,…

PyTorch Team at MetaOctober 22, 2025

Blog

Introducing PyTorch Monarch

We now live in a world where ML workflows (pre-training, post training, etc) are heterogeneous,…

The PyTorch Team at MetaOctober 22, 2025

Blog

Introducing torchforge – a PyTorch native library for scalable RL post-training and agentic development

In this post, we announce torchforge: A PyTorch-native agentic RL library that lets you focus…

The PyTorch Team at MetaOctober 22, 2025

Blog

Enabling vLLM V1 on AMD GPUs With Triton

What is vLLM V1? In January 2025, the vLLM team announced the alpha release of…

vLLM Team at IBM Research, vLLM Team at Red Hat, and vLLM Team at AMDOctober 21, 2025

Blog

PyTorch 2.9 Release Blog

We are excited to announce the release of PyTorch® 2.9 (release notes)! This release features: …

PyTorch FoundationOctober 15, 2025

Blog

SuperOffload: Unleashing the Power of Large-Scale LLM Training on Superchips

TLDR: Efficient full-parameter fine-tuning of GPT-OSS-20B & Qwen3-14B models on a single NVIDIA GH200 and…

Xinyu Lian, Minjia Zhang (SSAIL Lab, University of Illinois Urbana-Champaign), Masahiro Tanaka (Anyscale), Olatunji Ruwase (Snowflake)October 9, 2025

Blog Community

When Quantization Isn’t Enough: Why 2:4 Sparsity Matters

TL;DR Combining 2:4 sparsity with quantization offers a powerful approach to compress large language models…

Mohammad Mozaffari, Jesse Cai, Supriya RaoOctober 6, 2025

Blog

TorchAO Quantized Models and Quantization Recipes Now Available on HuggingFace Hub

PyTorch now offers native quantized variants of Phi4-mini-instruct, Qwen3, SmolLM3-3B and gemma-3-270m-it through a collaboration…

Meta: Jerry Zhang, Scott Roy, Mergen Nachin, Kimish Patel, Supriya Rao, Jack Zhang, Guang Yang & Unsloth AI: Daniel HanSeptember 19, 2025

Blog

Experience in Reducing PT2 Compilation Time for Meta Internal Workloads

The Challenge of PyTorch 2.0 Compilation Since the release of PyTorch 2.0 (PT2) and its…

Mingming Ding, James Wu, Oguz Ulgen, Sam Larsen, Bob Ren, Laith Sakka, Pian Pawakapan, Animesh Jain, Edward Yang, Yuzhen Huang, Ruilin Chen, Daohang Shi, Shuai Yang, Menglu Yu, Chunzhi Yang, Jade NieSeptember 18, 2025

Blog

High-performance quantized LLM inference on Intel CPUs with native PyTorch

PyTorch 2.8 has just been released with a set of exciting new features, including a…

Intel PyTorch TeamSeptember 17, 2025

Blog

PyTorch 2.8 Brings Native XCCL Support to Intel GPUs: Case Studies from Argonne National Laboratory

Intel announces a major enhancement for distributed training in PyTorch 2.8: the native integration of…

Intel PyTorch Team, Argonne National LaboratorySeptember 12, 2025

Efficient MoE Pre-training at Scale on 1K AMD GPUs with TorchTitan

The Future of Inference: PyTorch ATX Event

OpenReg: A Self-Contained PyTorch Accelerator Simulator

Beyond Quantization: Bringing Sparse Inference to PyTorch

KernelFalcon: Autonomous GPU Kernel Generation via Deep Agents

Hybrid Models as First-Class Citizens in vLLM

Monarch + Lightning AI: Unlocking New Possibilities in Distributed Training

torchcomms: a modern PyTorch communications API

Helion: A High-Level DSL for Performant and Portable ML Kernels

Introducing ExecuTorch 1.0: Powering the next generation of edge AI

Introducing PyTorch Monarch

Introducing torchforge – a PyTorch native library for scalable RL post-training and agentic development

Enabling vLLM V1 on AMD GPUs With Triton

PyTorch 2.9 Release Blog

SuperOffload: Unleashing the Power of Large-Scale LLM Training on Superchips

When Quantization Isn’t Enough: Why 2:4 Sparsity Matters

TorchAO Quantized Models and Quantization Recipes Now Available on HuggingFace Hub

Experience in Reducing PT2 Compilation Time for Meta Internal Workloads

High-performance quantized LLM inference on Intel CPUs with native PyTorch

PyTorch 2.8 Brings Native XCCL Support to Intel GPUs: Case Studies from Argonne National Laboratory

Docs

Tutorials

Resources

Stay in touch for updates, event info, and the latest news