We are excited to announce the release of PyTorch® 2.9 (release notes)! This release features:
- Updates to the stable libtorch ABI for third-party C++/CUDA extensions
- Symmetric memory that enables easy programming of multi-GPU kernels
- The ability to arbitrarily toggle error or resume on graph breaks in torch.compile
- Expanded wheel variant support to include ROCm, XPU, and CUDA 13
- FlexAttention enablement on Intel GPUs
- Flash decoding optimization based on FlexAttention on X86 CPU
- Arm Platform improvements and optimizations
- Enablement of Linux aarch64 binary wheel builds across all supported CUDA versions
This release is composed of 3216 commits from 452 contributors since PyTorch 2.8. We want to sincerely thank our dedicated community for your contributions. As always, we encourage you to try these out and report any issues as we improve 2.9. More information about how to get started with the PyTorch 2-series can be found at our Getting Started page.
API-UNSTABLE FEATURES
[API-Unstable] torch::stable::Tensor
If you maintain and build your own custom C++/CUDA extensions with PyTorch, this update is for you! We’ve been building out a stable ABI with C++ convenience wrappers to enable you to build extensions with one torch version and run with another. We’ve added the following APIs since the last release:
- Introducing device utils (think Device Guard and Stream) accessible in torch/csrc/stable/accelerator.h.
- More torch::stable::Tensor APIs: a default constructor, is_cpu, scalar_type, and get_device_index
- More stable ATen ops accessible in torch/csrc/stable/ops.h: amax, narrow, new_empty + new_zeros dtype variant, pad
With these APIs, we have been able to enable a libtorch-ABI wheel for Flash-Attention 3: see the PR here. While we have been intentional about API design to ensure maximal stability, please note that the highlevel C++ APIs are still in preview! We are working on many next steps: building out the ABI surface, establishing versioning, writing more docs, and enabling more custom kernels to be ABI stable.
[API-Unstable] Symmetric memory programming
We introduce PyTorch Symmetric Memory to enable easy programming of multi-GPU kernels that work over NVLinks as well as RDMA networks. Symmetric Memory unlocks three new programming opportunities:
- In-kernel communication: GPU kernels are now able to issue communication primitives such as puts and gets, interleaved with computation instructions, allowing fusion at the smallest possible granularity.
- Ultralow-latency remote access: remote access can be made one-sided, without waiting for remote GPUs to issue a corresponding command. Protocols like RDMA and IB-GDA enable direct memory access over the network.
- Customized communication patterns: increased flexibility allows authoring kernels tailored to the communication need of the application. It will also be straightforward to add support for new data layouts, even if it’s not present yet in the standard libraries.
Release 2.9 will include:
- Allocation of symmetric tensors that allow remote direct access (currently supported backends:
CUDA
andNVSHMEM).
- Accelerated collectives leveraging direct access, such as:
one_shot_all_reduce, two_shot_all_reduce_, multimem_all_gather_out
, etc. - Nontraditional all_to_all_v for MoE models (
for token dispatch,all_to_all_vdev, all_to_all_vdev_2d
all_to_all_vdev_2d_offset
for token combination). - Programming support for customized multi-GPU kernels: NVSHMEM plugin for Triton. Continued support of Async TP and generalization to other patterns.
Symmetric Memory operations are available undertorch.ops.symm_mem.
[API-Unstable] Ability to toggle error or resume on graph breaks in torch.compile
This feature expands on torch.compile’s graph break options by introducing torch._dynamo.error_on_graph_break(), a context manager/decorator which allows the user to mark regions of torch.compiled code where torch.compile should error or resume when encountering a graph break. In contrast to the existing fullgraph functionality, error_on_graph_break can be toggled arbitrarily (once fullgraph is set to true, it cannot be set back to false). More details in the tutorial.
[API-Unstable] Expanded wheel variant support
PyTorch continues its engagement to improve the Python packaging ecosystem through participation in the WheelNext initiative. On-going work in this field will be presented at the 2025 PyTorch Conference by Jonathan Dekhtiar (NVIDIA) and Eli Uriegas (Meta): Talk.
The PyTorch 2.9.0 release expands the scope of its wheel variant support matrix by adding AMD (ROCm), Intel (XPU) and NVIDIA CUDA 13. This work is a follow-up on the initial enablement of this feature in PyTorch 2.8. This includes provider plugins for all hardware platforms aforementioned, which can automatically detect the platform attributes, including the type of supported software and installed hardware.
While NVIDIA CUDA wheels support both Windows and Linux, ROCm (full blog here) and XPU platforms currently only support Linux.
NOTE: This particular feature is experimental and based on the wheel variants proposal. (work in progress)
Installation instructions
Linux x86 and aarch64, MacOS:
curl -LsSf https://astral.sh/uv/install.sh | INSTALLER_DOWNLOAD_URL=https://wheelnext.astral.sh/v0.0.2 sh uv venv uv pip install torch
Windows x86:
powershell -c { $env:INSTALLER_DOWNLOAD_URL = 'https://wheelnext.astral.sh/v0.0.2'; irm https://astral.sh/uv/install.ps1 | iex } uv venv uv pip install torch
[API-Unstable] FlexAttention Enablement on Intel GPUs
The new FlexAttention forward and backward support on Intel GPUs, aligned with PyTorch’s GPU common behavior, provides users with more consistent and portable performance across different GPUs. This means developers can write code once, or run Hugging Face/Transformers with extensive support for FlexAttention, achieving significant attention mechanism efficiency without modifying any code.
[API-Unstable] Flash decoding optimization based on FlexAttention on X86 CPU
Flash decoding, a common technique adopted in LLM inference in order to speed up attention and bring faster generation for long sequences, only exists for the CUDA path today. In this release, we introduce support for flash decoding optimization based on FlexAttention on X86 CPU inductor backend, which realizes the parallelism on KV sequence by partition and reduction. The optimization can greatly improve the CPU utilization when the original parallelism is insufficient, e.g. small batch size/head number/Q sequence length and long KV sequence length. This is expected to help PyTorch users improve the performance for LLM decoding phase, especially with long context length.
Arm Platform Enablement and Optimizations
PyTorch 2.9 delivers key backend improvements on Arm with better performance and test coverage.
- torch.compile Improvements: TorchBench, HuggingFace, and TIMM test suites in torch.compile mode are faster than Eager mode. See results on the PyTorch HUD Dashboard
- Operator improvements: Expanded and optimized convolution, activation, and quantized ops on AArch64 for faster model execution.
- Broadened Arm CI coverage by adding Arm Neoverse V2-based AWS Graviton 4 instances.