PyTorch 2.9 Release Blog

We are excited to announce the release of PyTorch® 2.9 (release notes)! This release features:

Updates to the stable libtorch ABI for third-party C++/CUDA extensions
Symmetric memory that enables easy programming of multi-GPU kernels
The ability to arbitrarily toggle error or resume on graph breaks in torch.compile
Expanded wheel variant support to include ROCm, XPU, and CUDA 13
FlexAttention enablement on Intel GPUs
Flash decoding optimization based on FlexAttention on X86 CPU
Arm Platform improvements and optimizations
Enablement of Linux aarch64 binary wheel builds across all supported CUDA versions

This release is composed of 3216 commits from 452 contributors since PyTorch 2.8. We want to sincerely thank our dedicated community for your contributions. As always, we encourage you to try these out and report any issues as we improve 2.9. More information about how to get started with the PyTorch 2-series can be found at our Getting Started page.

API-UNSTABLE FEATURES

[API-Unstable] torch::stable::Tensor

If you maintain and build your own custom C++/CUDA extensions with PyTorch, this update is for you! We’ve been building out a stable ABI with C++ convenience wrappers to enable you to build extensions with one torch version and run with another. We’ve added the following APIs since the last release:

Introducing device utils (think Device Guard and Stream) accessible in torch/csrc/stable/accelerator.h.
More torch::stable::Tensor APIs: a default constructor, is_cpu, scalar_type, and get_device_index
More stable ATen ops accessible in torch/csrc/stable/ops.h: amax, narrow, new_empty + new_zeros dtype variant, pad

With these APIs, we have been able to enable a libtorch-ABI wheel for Flash-Attention 3: see the PR here. While we have been intentional about API design to ensure maximal stability, please note that the highlevel C++ APIs are still in preview! We are working on many next steps: building out the ABI surface, establishing versioning, writing more docs, and enabling more custom kernels to be ABI stable.

[API-Unstable] Symmetric memory programming

We introduce PyTorch Symmetric Memory to enable easy programming of multi-GPU kernels that work over NVLinks as well as RDMA networks. Symmetric Memory unlocks three new programming opportunities:

In-kernel communication: GPU kernels are now able to issue communication primitives such as puts and gets, interleaved with computation instructions, allowing fusion at the smallest possible granularity.
Ultralow-latency remote access: remote access can be made one-sided, without waiting for remote GPUs to issue a corresponding command. Protocols like RDMA and IB-GDA enable direct memory access over the network.
Customized communication patterns: increased flexibility allows authoring kernels tailored to the communication need of the application. It will also be straightforward to add support for new data layouts, even if it’s not present yet in the standard libraries.

Release 2.9 will include:

Allocation of symmetric tensors that allow remote direct access (currently supported backends:CUDAandNVSHMEM).
Accelerated collectives leveraging direct access, such as:one_shot_all_reduce, two_shot_all_reduce_, multimem_all_gather_out, etc.
Nontraditional all_to_all_v for MoE models (all_to_all_vdev, all_to_all_vdev_2dfor token dispatch, all_to_all_vdev_2d_offset for token combination).
Programming support for customized multi-GPU kernels: NVSHMEM plugin for Triton. Continued support of Async TP and generalization to other patterns.

Symmetric Memory operations are available undertorch.ops.symm_mem.

For more information, see Symmetric Memory API Documentation.

[API-Unstable] Ability to toggle error or resume on graph breaks in torch.compile

This feature expands on torch.compile’s graph break options by introducing torch._dynamo.error_on_graph_break(), a context manager/decorator which allows the user to mark regions of torch.compiled code where torch.compile should error or resume when encountering a graph break. In contrast to the existing fullgraph functionality, error_on_graph_break can be toggled arbitrarily (once fullgraph is set to true, it cannot be set back to false). More details in the tutorial.

[API-Unstable] Expanded wheel variant support

PyTorch continues its engagement to improve the Python packaging ecosystem through participation in the WheelNext initiative. On-going work in this field will be presented at the 2025 PyTorch Conference by Jonathan Dekhtiar (NVIDIA) and Eli Uriegas (Meta): Talk.

The PyTorch 2.9.0 release expands the scope of its wheel variant support matrix by adding AMD (ROCm), Intel (XPU) and NVIDIA CUDA 13. This work is a follow-up on the initial enablement of this feature in PyTorch 2.8. This includes provider plugins for all hardware platforms aforementioned, which can automatically detect the platform attributes, including the type of supported software and installed hardware.

While NVIDIA CUDA wheels support both Windows and Linux, ROCm (full blog here) and XPU platforms currently only support Linux.

NOTE: This particular feature is experimental and based on the wheel variants proposal. (work in progress)

Installation instructions

Linux x86 and aarch64, MacOS:

curl -LsSf https://astral.sh/uv/install.sh | 
INSTALLER_DOWNLOAD_URL=https://wheelnext.astral.sh/v0.0.2 sh
uv venv
uv pip install torch

Windows x86:

powershell -c { $env:INSTALLER_DOWNLOAD_URL = 'https://wheelnext.astral.sh/v0.0.2'; irm 
https://astral.sh/uv/install.ps1 | iex }
uv venv
uv pip install torch

[API-Unstable] FlexAttention Enablement on Intel GPUs

The new FlexAttention forward and backward support on Intel GPUs, aligned with PyTorch’s GPU common behavior, provides users with more consistent and portable performance across different GPUs. This means developers can write code once, or run Hugging Face/Transformers with extensive support for FlexAttention, achieving significant attention mechanism efficiency without modifying any code.

[API-Unstable] Flash decoding optimization based on FlexAttention on X86 CPU

Flash decoding, a common technique adopted in LLM inference in order to speed up attention and bring faster generation for long sequences, only exists for the CUDA path today. In this release, we introduce support for flash decoding optimization based on FlexAttention on X86 CPU inductor backend, which realizes the parallelism on KV sequence by partition and reduction. The optimization can greatly improve the CPU utilization when the original parallelism is insufficient, e.g. small batch size/head number/Q sequence length and long KV sequence length. This is expected to help PyTorch users improve the performance for LLM decoding phase, especially with long context length.

Arm Platform Enablement and Optimizations

PyTorch 2.9 delivers key backend improvements on Arm with better performance and test coverage.

torch.compile Improvements: TorchBench, HuggingFace, and TIMM test suites in torch.compile mode are faster than Eager mode. See results on the PyTorch HUD Dashboard
Operator improvements: Expanded and optimized convolution, activation, and quantized ops on AArch64 for faster model execution.
Broadened Arm CI coverage by adding Arm Neoverse V2-based AWS Graviton 4 instances.

PyTorch 2.9 Release Blog

API-UNSTABLE FEATURES

[API-Unstable] torch::stable::Tensor

[API-Unstable] Symmetric memory programming

[API-Unstable] Ability to toggle error or resume on graph breaks in torch.compile

[API-Unstable] Expanded wheel variant support

[API-Unstable] FlexAttention Enablement on Intel GPUs

[API-Unstable] Flash decoding optimization based on FlexAttention on X86 CPU

Arm Platform Enablement and Optimizations

Docs

Tutorials

Resources

Stay in touch for updates, event info, and the latest news