(Beta) PyTorch Inference Performance Tuning on AWS Graviton Processors¶
Author: Sunita Nadampalli
AWS Graviton is a series of ARM-based processors designed by AWS. AWS Graviton3 processors are optimized for Machine Learning (ML) workloads, including support for bfloat16
, Scalable Vector Extension (SVE) and twice the Single Instruction Multiple Data (SIMD) bandwidth compared to Graviton2.
PyTorch provides native reference ATen kernels for the machine learning operators like convolutions, matmul, relu, etc. These operators can be accelerated with platform specific kernel implementations from Basic Linear Algebra (BLAS) libraries. On AWS Graviton CPUs, MKLDNN with Arm Compute Library (ACL) and OpenBLAS libraries provide optimized implementations for a subset of the operators. Both these libraries are integrated into PyTorch with PyTorch 2.0 version.
In this tutorial we will cover how to achieve the best inference performance for linear layer neural network on AWS Graviton3 CPUs (AWS c7g instance) with bfloa16
kernels and with the right backend selection.
Contents¶
Basic Usage
Speed up inference with Bfloat16 fast math kernels
Improve inference performance with OpenBLAS for smaller batch dimensions
Optimize memory allocation overhead with Linux Transparent huge pages
Conclusion
Note
To successfully run this tutorial and reproduce the speedup numbers shown below, you need an instance from the Graviton3 family (c7g/r7g/m7g
) of hardware. For this tutorial, we used the c7g.xl (4vcpu) instance .
Basic Usage¶
PyTorch natively supports AWS Graviton3 optimizations starting with PyTorch 2.0 version. Please refer to this blog for more details on the optimizations.
Install PyTorch by running the following command:
python3 -m pip install torch
We will start by importing the required dependencies and defining the device will run on:
import torch
import torch.nn as nn
from torch.profiler import profile, record_function, ProfilerActivity
# AWS Graviton3 cpu
device = ("cpu")
print(f"Using {device} device")
Given linear layers are at the heart of several neural networks, including transformers, we take a linear layer for this demo. We define our neural network by subclassing
nn.Module
, and initializing the layers in__init__
. We construct the network with a typical large language model parameters to match the real world scenario:
class MyNeuralNetwork(nn.Module):
def __init__(self):
super().__init__()
self.flatten = nn.Flatten()
self.linear_relu_stack = nn.Sequential(
nn.Linear(4096, 4096),
nn.ReLU(),
nn.Linear(4096, 11008),
nn.ReLU(),
nn.Linear(11008, 10),
)
def forward(self, x):
x = self.flatten(x)
logits = self.linear_relu_stack(x)
return logits
Let’s create an instance of
MyNeuralNetwork
, and move it to the device:
model = MyNeuralNetwork().to(device)
print(model)
Next, let’s get the prediction probabilities by passing them through an instance of the nn.Softmax
module:
X = torch.rand(1, 64, 64, device=device)
logits = model(X)
pred_probab = nn.Softmax(dim=1)(logits)
y_pred = pred_probab.argmax(1)
print(f"Predicted class: {y_pred}")
output:
Predicted class: tensor([2])
Our network functionality is verified. Next, we will profile the performance. Lets’ check two different scenarios: small and large batch dimensions.
Scenario 1: A larger batch dimension, for example 256:
# warm it up first and loop over multiple times to have enough execution time
X = torch.rand(256, 64, 64, device=device)
with torch.set_grad_enabled(False):
for _ in range(50):
model(X) #Warmup
with profile(activities=[ProfilerActivity.CPU]) as prof:
with record_function("mymodel_inference"):
for _ in range(100):
model(X)
print(prof.key_averages().table(sort_by="self_cpu_time_total"))
Following is the profiler output with the default PyTorch configuration:
Name |
Self CPU % |
Self CPU |
CPU total % |
CPU total |
CPU time avg |
# of Calls |
---|---|---|---|---|---|---|
aten::addmm |
97.61% |
15.813s |
98.61% |
15.977s |
53.255ms |
300 |
aten::clamp_min |
1.09% |
177.032ms |
1.09% |
177.032ms |
885.160us |
200 |
aten::copy |
1.00% |
162.054ms |
1.00% |
162.054ms |
540.180us |
300 |
mymodel_inference |
0.22% |
35.738ms |
100.00% |
16.201s |
16.201s |
1 |
aten::linear |
0.02% |
2.955ms |
98.66% |
15.985s |
53.282ms |
300 |
aten::t |
0.01% |
2.421ms |
0.03% |
5.043ms |
16.810us |
300 |
aten::relu |
0.01% |
2.356ms |
1.11% |
179.388ms |
896.940us |
200 |
Self CPU time total: 16.201s
Speed up Inference with bfloat16
Fast Math Kernels¶
AWS Graviton3 processors support bfloat16 MMLA instructions. Arm Compute Library (ACL) provides optimized bfloat16
General Matrix Multiplication (GEMM) kernels for AWS Graviton processors, and are integrated into PyTorch via MKLDNN backend starting with PyTorch 2.0. The inference performance can be optimized with the fast math GEMM kernels. The fast math mode is not enabled by default because these kernels perform GEMM in bfloat16
precision instead of float
, and hence results in a slight drop in the model inference accuracy. However, the accuracy drop is within the cosine similarity
threshold defined for bfloat16
backend in torchbench
test suite, and hence acceptable for majority of the applications. To enable the fast math GEMM kernels, set the following environment variable:
$ export DNNL_DEFAULT_FPMATH_MODE=BF16
When you run the above inference script, you should see the following profiler output with the MKLDNN fast math mode enabled:
Name |
Self CPU % |
Self CPU |
CPU total % |
CPU total |
CPU time avg |
# of Calls |
---|---|---|---|---|---|---|
aten::addmm |
95.61% |
6.943s |
97.10% |
7.052s |
23.507ms |
300 |
aten::clamp_min |
2.31% |
167.653ms |
2.31% |
167.653ms |
838.265us |
200 |
aten::copy |
1.48% |
107.593ms |
1.48% |
107.593ms |
358.643us |
300 |
mymodel_inference |
0.43% |
31.167ms |
100.00% |
7.262s |
7.262s |
1 |
aten::linear |
0.04% |
2.911ms |
97.21% |
7.060s |
23.533ms |
300 |
aten::t |
0.03% |
2.414ms |
0.07% |
4.892ms |
16.307us |
300 |
aten::relu |
0.03% |
2.281ms |
2.34% |
169.934ms |
849.670us |
200 |
Self CPU time total: 7.262s
This is around 2x (7.262s vs 16.201s)
performance improvement with the bfloat16
fastmath kernels. Next, let’s look at the smaller batch dimension scenario.
Scenario 2: A smaller batch dimension, for example, 32:
X = torch.rand(32, 64, 64, device=device)
with torch.set_grad_enabled(False):
for _ in range(50):
model(X) #Warmup
with profile(activities=[ProfilerActivity.CPU]) as prof:
with record_function("mymodel_inference"):
for _ in range(100):
model(X)
print(prof.key_averages().table(sort_by="self_cpu_time_total"))
You should see the following profiler output when the above script is run with the PyTorch default configuration:
Name |
Self CPU % |
Self CPU |
CPU total % |
CPU total |
CPU time avg |
# of Calls |
---|---|---|---|---|---|---|
aten::addmm |
95.51% |
5.821s |
97.04% |
5.914s |
19.713ms |
300 |
aten::clamp_min |
2.33% |
142.244ms |
2.33% |
142.244ms |
711.220us |
200 |
aten::copy |
1.51% |
92.322ms |
1.51% |
92.322ms |
307.740us |
300 |
mymodel_inference |
0.45% |
27.713ms |
100.00% |
6.094s |
6.094s |
1 |
aten::linear |
0.04% |
2.495ms |
97.16% |
5.921s |
19.736ms |
300 |
aten::t |
0.03% |
2.131ms |
0.07% |
4.441ms |
14.803us |
300 |
aten::relu |
0.03% |
1.942ms |
2.37% |
144.186ms |
720.930us |
200 |
Self CPU time total: 6.094s
The following output is the profiler output when run with the MKLDNN fast math mode enabled:
$ export DNNL_DEFAULT_FPMATH_MODE=BF16
Name |
Self CPU % |
Self CPU |
CPU total % |
CPU total |
CPU time avg |
# of Calls |
---|---|---|---|---|---|---|
aten::addmm |
93.31% |
3.848s |
95.66% |
3.944s |
13.148ms |
300 |
aten::clamp_min |
3.43% |
141.309ms |
3.43% |
141.309ms |
706.545us |
200 |
aten::copy |
2.33% |
95.916ms |
2.33% |
95.916ms |
319.720us |
300 |
mymodel_inference |
0.67% |
27.431ms |
100.00% |
4.123s |
4.123s |
1 |
aten::linear |
0.06% |
2.471ms |
95.83% |
3.951s |
13.170ms |
300 |
aten::t |
0.05% |
2.027ms |
0.10% |
4.243ms |
14.143us |
300 |
aten::relu |
0.05% |
1.928ms |
3.47% |
143.237ms |
716.185us |
200 |
Self CPU time total: 4.123s
The MKLDNN fast math mode yields approximately a 1.47x (4.123s vs 6.094s) performance improvement for smaller batch dimensions. Although this improvement is noteworthy, the overall performance still leaves room for improvement. This is because of the runtime overhead (weights reorders and kernel launch time) from oneDNN and ACL backend outweighing the compute benefits from the ACL GEMM kernels for the smaller batch compute.
Improve Inference Performance with OpenBLAS for Smaller Batch Dimensions¶
The inference performance for smaller batch dimensions can be improved by offloading the smaller shapes from MKLDNN to OpenBLAS backend. We are working on making the backend selection automatic, with robust heuristics, for the future releases. Till the heuristics are implemented, the smaller shapes can be offloaded to OpenBLAS by increasing the threshold for MKLDNN backend selection. In the following example, we use 64
as the threshold, so that input with batch dimension of 32
is not dispatched to MKLDNN. Instead, it is dispatched to OpenBLAS.
$ export TORCH_MKLDNN_MATMUL_MIN_DIM=64
Here is the profiler output with OpenBLAS backend:
Name |
Self CPU % |
Self CPU |
CPU total % |
CPU total |
CPU time avg |
# of Calls |
---|---|---|---|---|---|---|
aten::addmm |
96.25% |
1.958s |
97.51% |
1.984s |
6.612ms |
300 |
aten::clamp_min |
1.28% |
26.124ms |
1.28% |
26.124ms |
130.620us |
200 |
aten::copy |
1.23% |
24.951ms |
1.23% |
24.951ms |
83.170us |
300 |
mymodel_inference |
0.86% |
17.423ms |
100.00% |
2.034s |
2.034s |
1 |
aten::linear |
0.08% |
1.691ms |
97.74% |
1.988s |
6.628ms |
300 |
aten::t |
0.07% |
1.520ms |
0.14% |
2.945ms |
9.817us |
300 |
aten::relu |
0.06% |
1.258ms |
1.35% |
27.382ms |
136.910us |
200 |
Self CPU time total: 2.034s
As you can see above, switching to OpenBLAS doubled the performance (2.034s vs 4.123s) compared to the default MKLDNN backend configuration. This becomes significant for even smaller batch dimensions, for example, for a batch dimension of 10:
X = torch.rand(10, 64, 64, device=device)
with torch.set_grad_enabled(False):
for _ in range(50):
model(X) #Warmup
with profile(activities=[ProfilerActivity.CPU]) as prof:
with record_function("mymodel_inference"):
for _ in range(100):
model(X)
print(prof.key_averages().table(sort_by="self_cpu_time_total"))
The following is the profiler output with MKLDNN fast math mode:
Name |
Self CPU % |
Self CPU |
CPU total % |
CPU total |
CPU time avg |
# of Calls |
---|---|---|---|---|---|---|
aten::addmm |
87.81% |
3.613s |
91.90% |
3.781s |
12.604ms |
300 |
aten::clamp_min |
7.18% |
295.437ms |
7.18% |
295.437ms |
1.477ms |
200 |
aten::copy |
4.07% |
167.516ms |
4.07% |
167.516ms |
558.387us |
300 |
mymodel_inference |
0.67% |
27.708ms |
100.00% |
4.115s |
4.115s |
1 |
aten::linear |
0.06% |
2.499ms |
92.06% |
3.788s |
12.627ms |
300 |
aten::t |
0.05% |
1.982ms |
0.11% |
4.385ms |
14.617us |
300 |
aten::relu |
0.05% |
1.932ms |
7.23% |
297.369ms |
1.487ms |
200 |
Self CPU time total: 4.115s
and the following is the profiler output with the OpenBLAS backend:
$ export TORCH_MKLDNN_MATMUL_MIN_DIM=64
Name |
Self CPU % |
Self CPU |
CPU total % |
CPU total |
CPU time avg |
# of Calls |
---|---|---|---|---|---|---|
aten::addmm |
92.66% |
1.179s |
95.23% |
1.211s |
4.038ms |
300 |
aten::clamp_min |
2.83% |
36.060ms |
2.83% |
36.060ms |
180.300us |
200 |
aten::copy |
2.52% |
32.013ms |
2.52% |
32.013ms |
106.710us |
300 |
mymodel_inference |
1.38% |
17.521ms |
100.00% |
1.272s |
1.272s |
1 |
aten::linear |
0.14% |
1.750ms |
95.60% |
1.216s |
4.054ms |
300 |
aten::t |
0.12% |
1.475ms |
0.24% |
3.033ms |
10.110us |
300 |
aten::relu |
0.10% |
1.285ms |
2.94% |
37.345ms |
186.725us |
200 |
Self CPU time total: 1.272s
Here we observed 3.2x (1.272s vs 4.115s) performance improvement by tuning the backend thresholds appropriately.
Optimize Memory Allocation Overhead with Linux Transparent Huge Pages (THP)¶
We also observed that for these larger networks, tensor memory allocations take significant portion of the inference latency. This can be optimized by enabling Linux transparent huge page allocations from PyTorch C10 memory allocator. Currently the feature is not enabled by default because it will increase the memory footprint marginally. Set the following environment variable to enable it:
$ export THP_MEM_ALLOC_ENABLE=1
For the batch dimension of 256 and with MKLDNN fast math mode:
X = torch.rand(256, 64, 64, device=device)
with torch.set_grad_enabled(False):
for _ in range(50):
model(X) #Warmup
with profile(activities=[ProfilerActivity.CPU]) as prof:
with record_function("mymodel_inference"):
for _ in range(100):
model(X)
print(prof.key_averages().table(sort_by="self_cpu_time_total"))
The following is the profiler output with THP memory allocations enabled:
Name |
Self CPU % |
Self CPU |
CPU total % |
CPU total |
CPU time avg |
# of Calls |
---|---|---|---|---|---|---|
aten::addmm |
91.31% |
6.115s |
94.39% |
6.321s |
21.069ms |
300 |
aten::clamp_min |
4.82% |
322.568ms |
4.82% |
322.568ms |
1.613ms |
200 |
aten::copy |
3.06% |
204.602ms |
3.06% |
204.602ms |
682.007us |
300 |
mymodel_inference |
0.61% |
40.777ms |
100.00% |
6.697s |
6.697s |
1 |
aten::linear |
0.05% |
3.082ms |
94.51% |
6.329s |
21.097ms |
300 |
aten::relu |
0.04% |
2.547ms |
4.85% |
325.115ms |
1.626ms |
200 |
Self CPU time total: 6.697s
This is an additional 1.08x or 8% (6.697s vs 7.262s) improvement on top of the already optimized MKLDNN fast math mode measured above.
Conclusion¶
In this tutorial, we covered PyTorch inference on AWS Graviton3 instances by covering the basic usage, demonstrating speedups with fast math kernels, comparing different backends for different batch dimensions, and how to optimize tensor memory allocation latencies with Linux transparent huge pages. The recommendation is to use MKLDNN backend with Bfloat16 fastmath mode and THP memory allocations for larger tensor shapes and to use OpenBLAS backend for smaller tensor shapes. We hope that you will give it a try!