PyTorch 2.2: FlashAttention-v2 integration, AOTInductor

by Team PyTorch

We are excited to announce the release of PyTorch® 2.2 (release note)! PyTorch 2.2 offers ~2x performance improvements to scaled_dot_product_attention via FlashAttention-v2 integration, as well as AOTInductor, a new ahead-of-time compilation and deployment tool built for non-python server-side deployments.

This release also includes improved torch.compile support for Optimizers, a number of new inductor optimizations, and a new logging mechanism called TORCH_LOGS.

Please note that we are deprecating macOS x86 support, and PyTorch 2.2.x will be the last version that supports macOS x64.

Along with 2.2, we are also releasing a series of updates to the PyTorch domain libraries. More details can be found in the library updates blog.

This release is composed of 3,628 commits and 521 contributors since PyTorch 2.1. We want to sincerely thank our dedicated community for your contributions. As always, we encourage you to try these out and report any issues as we improve 2.2. More information about how to get started with the PyTorch 2-series can be found at our Getting Started page.

Summary:

scaled_dot_product_attention (SDPA) now supports FlashAttention-2, yielding around 2x speedups compared to previous versions.
PyTorch 2.2 introduces a new ahead-of-time extension of TorchInductor called AOTInductor, designed to compile and deploy PyTorch programs for non-python server-side.
torch.distributed supports a new abstraction for initializing and representing ProcessGroups called device_mesh.
PyTorch 2.2 ships a standardized, configurable logging mechanism called TORCH_LOGS.
A number of torch.compile improvements are included in PyTorch 2.2, including improved support for compiling Optimizers and improved TorchInductor fusion and layout optimizations.
Please note that we are deprecating macOS x86 support, and PyTorch 2.2.x will be the last version that supports macOS x64.

Stable	Beta	Performance Improvements
	FlashAttention-2 Integration	Inductor optimizations
	AOTInductor	aarch64 optimizations
	TORCH_LOGS
	device_mesh
	Optimizer compilation

*To see a full list of public feature submissions click here.

Beta Features

[Beta] FlashAttention-2 support in torch.nn.functional.scaled_dot_product_attention

torch.nn.functional.scaled_dot_product_attention (SDPA) now supports FlashAttention-2, yielding around 2x speedups (compared to the previous version) and reaching ~50-73% of theoretical maximum FLOPs/s on A100 GPUs.

More information is available on FlashAttention-2 in this paper.

For a tutorial on how to use SDPA please see this tutorial.

[Beta] AOTInductor: ahead-of-time compilation and deployment for torch.export-ed programs

AOTInductor is an extension of TorchInductor, designed to process exported PyTorch models, optimize them, and produce shared libraries as well as other relevant artifacts. These compiled artifacts can be deployed in non-Python environments, which are frequently employed for inference on the server-side. Note that AOTInductor supports the same backends as Inductor, including CUDA, ROCm, and CPU.

For more information please see the AOTInductor tutorial.

[Beta] Fine-grained configurable logging via TORCH_LOGS

PyTorch now ships a standardized, configurable logging mechanism that can be used to analyze the status of various subsystems such as compilation and distributed operations.

Logs can be enabled via the TORCH_LOGS environment variable. For example, to set the log level of TorchDynamo to logging.ERROR and the log level of TorchInductor to logging.DEBUG pass TORCH_LOGS=”-dynamo,+inductor” to PyTorch.

For more information, please see the logging documentation and tutorial.

[Beta] torch.distributed.device_mesh

PyTorch 2.2 introduces a new abstraction for representing the ProcessGroups involved in distributed parallelisms called torch.distributed.device_mesh. This abstraction allows users to represent inter-node and intra-node process groups via an N-dimensional array where, for example, one dimension can data parallelism in FSDP while another could represent tensor parallelism within FSDP.

For more information, see the device_mesh tutorial.

[Beta] Improvements to torch.compile-ing Optimizers

A number of improvements have been made to torch.compile-ing Optimizers including less overhead and support for cuda graphs.

More technical details of the improvements are available on dev-discuss, and a recipe for torch.compile-ing optimizers is available here.

Performance Improvements

Inductor Performance Optimizations

A number of performance optimizations have been added to TorchInductor including horizontal fusion support for torch.concat, improved convolution layout optimizations, and improved scaled_dot_product_attention pattern matching.

For a complete list of inductor optimizations, please see the Release Notes.

aarch64 Performance Optimizations

PyTorch 2.2 includes a number of performance enhancements for aarch64 including support for mkldnn weight pre-packing, improved ideep primitive caching, and improved inference speed via fixed format kernel improvements to OneDNN.

For a complete list of aarch64 optimizations, please see the Release Notes.