Blog

Blog

Accelerating Mamba2 with Kernel Fusion

Summary In this post, we discuss how we optimized the Mamba-2 State-Space Dual (SSD) module…

Rishi Astra, Tri Dao, Adnan HoqueFebruary 6, 2026

Blog

Some Matrix Multiplication Engines Are Not As Accurate As We Thought

What is an accumulator in an accelerator's GEMM engine and why does it matter? GPUs…

Chi-Chun (Charlie) Liu, Monodeep Kar, Naigang Wang, Raghu Kiran Ganti, Mudhakar SrivatsaFebruary 6, 2026

Blog

Building Highly Efficient Inference System for Recommenders Using PyTorch

Why Choose PyTorch for Recommendation System PyTorch has emerged as the de facto framework in…

Lu Fang, Shiyan Deng, Hongyi Jia, Huamin Li, Ilina Mitra, Sheng Qin, Zhengkai Zhang, Zhuoran Zhao, Zinnia ZhengFebruary 5, 2026

Blog

Portable Paged Attention in Helion

Recently, the PyTorch team released Helion, a new domain-specific and PyTorch-based language to make the…

Burkhard Ringlein (IBM Research) and the vLLM Team at IBM ResearchFebruary 3, 2026

Blog Community

Unlock Reasoning in Llama 3.1-8B via Full Fine-Tuning on NVIDIA DGX Spark

What is the unsaid joy of local LLMs? The magic of downloading weights, running some…

Sanyam Bhutani (PyTorch Meta), Hamid Shojanazeri (PyTorch Meta), Clement Anthonioz Blanc (Meta)February 2, 2026

Blog

Accelerating On-Device ML Inference with ExecuTorch and Arm SME2

Interactive image segmentation has become a defining mobile experience across the world’s most popular apps.…

Jason Zhu, Tyler Mullenbach, Damien Dooley, and Gian Marco Idoice, ArmJanuary 29, 2026

Blog

PyTorch 2.10 Release Blog

We are excited to announce the release of PyTorch® 2.10 (release notes)! This release features…

PyTorch FoundationJanuary 21, 2026

Blog

Supercharging LLMs: Scalable RL with torchforge and Weaver

Scaling reinforcement learning (RL) for post-training large language models (LLMs) is notoriously difficult. While running…

Stanford - Jon Saad-Falcon, Hangoo Kang, Simon Guo, Aakanksha Chowdhery, Azalia Mirhoseini Meta - Allen Wang, Danning Xie, Evan Smothers, Felipe Mello, Jack Khuu, Jiyue Wang, Joe Cummings, Lucas Pasqualin, Philip Bontrager, Rithesh Baradi, Vidhya Venkat, Yuxuan Hu, Jafar Taghiyar, Davide Italiano, Gayathri Aiyer, John Myles White, Joe Spisak, Sanyam Bhutani, Hamid Shojanazeri, Matthias Reso Ali Sol Hossein Kavianihamedani Emre Guven CoreWeave - Deok Filho Aaron Batilo Matthew Guan Xi LuJanuary 9, 2026

Blog

Warp Specialization in Triton: Design and Roadmap

The Triton compiler aims to generate performance-portable code and runtime across hardware for AI kernels.…

Manman Ren, Nick Riasanovsky, Neil Dhar, Hongtao Yu, Jie Liu, Partha Kanuparthy, Shane NayJanuary 8, 2026

Blog

PyTorch 2.9: FlexAttention Optimization Practice on Intel GPUs

Overview The most recent LLM serving frameworks and models increasingly adopt attention variants, such as…

Intel PyTorch and Triton teamJanuary 8, 2026

Blog

Deploying Smarter: Hardware-Software Co-design in PyTorch

If you want powerful on-device AI that doesn’t blow your memory budget or turn your…

Kieran Hejmadi, ArmDecember 18, 2025

Blog

Enabling Cluster Launch Control with TLX

What is cluster launch control (CLC)? Blackwell brings in cluster launch control (CLC) to enable…

Daohang Shi, Hongtao Yu, Manman RenDecember 17, 2025

Blog Community

Hybrid Models Meet SGLang: More than Full Attention

Introduction Hybrid models that combine the capabilities of full attention layers with alternatives—such as Mamba…

SGLang TeamDecember 3, 2025

Blog

Efficient MoE Pre-training at Scale on 1K AMD GPUs with TorchTitan

Training massive Mixture-of-Experts (MoE) models like DeepSeek-V3 and Llama 4-Scout efficiently is one of the…

AMD Contributors: Liz Li, Yanyuan Qin, Yuankai Chen, Xinyu Kang, Xiaobo Chen, Zhen Huang, Shekhar Pandey, Zhenyu Gu, Andy Luo, Meta Contributors: Matthias Reso, Hamid Shojanazeri, Tianyu Liu, Jiani Wang, Howard Huang, Wei Feng, Special Thanks: Guru MP, Yao Fu, Nick Ni, Emad Barsoum, Ramine Roane, and the TensorWave team for providing MI325 clusterDecember 1, 2025

Blog Community

The Future of Inference: PyTorch ATX Event

On September 17, 2025, PyTorch ATX partnered with the vLLM community and Red Hat to…

Jason Meaux, ATX PyTorch leader and Stephen Watt, PyTorch Ambassador, Red HatNovember 26, 2025

Blog

OpenReg: A Self-Contained PyTorch Accelerator Simulator

Introduction The PyTorch community is actively working to build a growing ecosystem of specialized accelerators…

Jiahao Chen (Huawei) & Jiawei Li (Huawei) & Zesheng Zong (Huawei)November 21, 2025

Blog Community

Beyond Quantization: Bringing Sparse Inference to PyTorch

As developers, we all know the story: Large Language Models (LLMs) are revolutionary, but their…

Kira Selby & Varun Khare (NimbleEdge)November 13, 2025

Blog

KernelFalcon: Autonomous GPU Kernel Generation via Deep Agents

Summary We introduce KernelFalcon, a deep agent architecture for generating GPU kernels that combines hierarchical…

Laura Wang and the PyTorch Team at MetaNovember 5, 2025

Blog

Hybrid Models as First-Class Citizens in vLLM

Introduction and Agenda Large language models are now running into the scaling limits of attention.…

vLLM Team at IBMNovember 5, 2025

Blog

Monarch + Lightning AI: Unlocking New Possibilities in Distributed Training

Introduction: Empowering the Next Generation of AI Builders We are excited to announce a partnership…

PyTorch Team at Meta: Alireza Shamsoshoara, Lucas Pasqualin, Peng Zhang, Hamid Shojanazeri, Ahmad Sharif, Kiuk Chung, Lightning AI: Lightning: Luca AntigaOctober 22, 2025

Accelerating Mamba2 with Kernel Fusion

Some Matrix Multiplication Engines Are Not As Accurate As We Thought

Building Highly Efficient Inference System for Recommenders Using PyTorch

Portable Paged Attention in Helion

Unlock Reasoning in Llama 3.1-8B via Full Fine-Tuning on NVIDIA DGX Spark

Accelerating On-Device ML Inference with ExecuTorch and Arm SME2

PyTorch 2.10 Release Blog

Supercharging LLMs: Scalable RL with torchforge and Weaver

Warp Specialization in Triton: Design and Roadmap

PyTorch 2.9: FlexAttention Optimization Practice on Intel GPUs

Deploying Smarter: Hardware-Software Co-design in PyTorch

Enabling Cluster Launch Control with TLX

Hybrid Models Meet SGLang: More than Full Attention

Efficient MoE Pre-training at Scale on 1K AMD GPUs with TorchTitan

The Future of Inference: PyTorch ATX Event

OpenReg: A Self-Contained PyTorch Accelerator Simulator

Beyond Quantization: Bringing Sparse Inference to PyTorch

KernelFalcon: Autonomous GPU Kernel Generation via Deep Agents

Hybrid Models as First-Class Citizens in vLLM

Monarch + Lightning AI: Unlocking New Possibilities in Distributed Training

Docs

Tutorials

Resources

Stay in touch for updates, event info, and the latest news