(Beta) PyTorch Inference Performance Tuning on AWS Graviton Processors ====================================================================== **Author**: `Sunita Nadampalli `_ `AWS Graviton `_ is a series of ARM-based processors designed by AWS. AWS Graviton3 processors are optimized for Machine Learning (ML) workloads, including support for ``bfloat16``, Scalable Vector Extension (SVE) and twice the Single Instruction Multiple Data (SIMD) bandwidth compared to Graviton2. PyTorch provides native reference ATen kernels for the machine learning operators like convolutions, matmul, relu, etc. These operators can be accelerated with platform specific kernel implementations from Basic Linear Algebra (BLAS) libraries. On AWS Graviton CPUs, MKLDNN with Arm Compute Library (`ACL `_) and `OpenBLAS `_ libraries provide optimized implementations for a subset of the operators. Both these libraries are integrated into PyTorch with PyTorch 2.0 version. In this tutorial we will cover how to achieve the best inference performance for linear layer neural network on AWS Graviton3 CPUs (`AWS c7g instance `_) with ``bfloa16`` kernels and with the right backend selection. Contents -------- 1. Basic Usage 2. Speed up inference with Bfloat16 fast math kernels 3. Improve inference performance with OpenBLAS for smaller batch dimensions 4. Optimize memory allocation overhead with Linux Transparent huge pages 5. Conclusion .. note:: To successfully run this tutorial and reproduce the speedup numbers shown below, you need an instance from the Graviton3 family (``c7g/r7g/m7g``) of hardware. For this tutorial, we used the `c7g.xl (4vcpu) instance `_ . Basic Usage --------------- PyTorch natively supports AWS Graviton3 optimizations starting with PyTorch 2.0 version. Please refer to this `blog `_ for more details on the optimizations. 1. Install PyTorch by running the following command: .. code-block:: python3 -m pip install torch 2. We will start by importing the required dependencies and defining the device will run on: .. code-block:: python import torch import torch.nn as nn from torch.profiler import profile, record_function, ProfilerActivity # AWS Graviton3 cpu device = ("cpu") print(f"Using {device} device") 3. Given linear layers are at the heart of several neural networks, including transformers, we take a linear layer for this demo. We define our neural network by subclassing ``nn.Module``, and initializing the layers in ``__init__``. We construct the network with a typical large language model parameters to match the real world scenario: .. code-block:: python class MyNeuralNetwork(nn.Module): def __init__(self): super().__init__() self.flatten = nn.Flatten() self.linear_relu_stack = nn.Sequential( nn.Linear(4096, 4096), nn.ReLU(), nn.Linear(4096, 11008), nn.ReLU(), nn.Linear(11008, 10), ) def forward(self, x): x = self.flatten(x) logits = self.linear_relu_stack(x) return logits 4. Let's create an instance of ``MyNeuralNetwork``, and move it to the device: .. code-block:: python model = MyNeuralNetwork().to(device) print(model) Next, let's get the prediction probabilities by passing them through an instance of the ``nn.Softmax`` module: .. code-block:: python X = torch.rand(1, 64, 64, device=device) logits = model(X) pred_probab = nn.Softmax(dim=1)(logits) y_pred = pred_probab.argmax(1) print(f"Predicted class: {y_pred}") output: .. code-block:: Predicted class: tensor([2]) Our network functionality is verified. Next, we will profile the performance. Lets' check two different scenarios: small and large batch dimensions. **Scenario 1:** A larger batch dimension, for example 256: .. code-block:: python # warm it up first and loop over multiple times to have enough execution time X = torch.rand(256, 64, 64, device=device) with torch.set_grad_enabled(False): for _ in range(50): model(X) #Warmup with profile(activities=[ProfilerActivity.CPU]) as prof: with record_function("mymodel_inference"): for _ in range(100): model(X) print(prof.key_averages().table(sort_by="self_cpu_time_total")) Following is the profiler output with the default PyTorch configuration: .. table:: :widths: auto ====================== ============ =========== ============= =========== ============ ============ Name Self CPU % Self CPU CPU total % CPU total CPU time avg # of Calls ====================== ============ =========== ============= =========== ============ ============ aten::addmm 97.61% 15.813s 98.61% 15.977s 53.255ms 300 aten::clamp_min 1.09% 177.032ms 1.09% 177.032ms 885.160us 200 aten::copy 1.00% 162.054ms 1.00% 162.054ms 540.180us 300 mymodel_inference 0.22% 35.738ms 100.00% 16.201s 16.201s 1 aten::linear 0.02% 2.955ms 98.66% 15.985s 53.282ms 300 aten::t 0.01% 2.421ms 0.03% 5.043ms 16.810us 300 aten::relu 0.01% 2.356ms 1.11% 179.388ms 896.940us 200 ====================== ============ =========== ============= =========== ============ ============ **Self CPU time total:** 16.201s Speed up Inference with ``bfloat16`` Fast Math Kernels ---------------------------------------------------------- AWS Graviton3 processors support `bfloat16 MMLA instructions `_. Arm Compute Library (`ACL `_) provides optimized ``bfloat16`` General Matrix Multiplication (GEMM) kernels for AWS Graviton processors, and are integrated into PyTorch via MKLDNN backend starting with PyTorch 2.0. The inference performance can be optimized with the fast math GEMM kernels. The fast math mode is not enabled by default because these kernels perform GEMM in ``bfloat16`` precision instead of ``float``, and hence results in a slight drop in the model inference accuracy. However, the accuracy drop is within the ``cosine similarity`` threshold defined for ``bfloat16`` backend in ``torchbench`` test suite, and hence acceptable for majority of the applications. To enable the fast math GEMM kernels, set the following environment variable: .. code-block:: bash $ export DNNL_DEFAULT_FPMATH_MODE=BF16 When you run the above inference script, you should see the following profiler output with the MKLDNN fast math mode enabled: .. table:: :widths: auto ====================== ============ ============ ============ ============ ============ ============ Name Self CPU % Self CPU CPU total % CPU total CPU time avg # of Calls ====================== ============ ============ ============ ============ ============ ============ aten::addmm 95.61% 6.943s 97.10% 7.052s 23.507ms 300 aten::clamp_min 2.31% 167.653ms 2.31% 167.653ms 838.265us 200 aten::copy 1.48% 107.593ms 1.48% 107.593ms 358.643us 300 mymodel_inference 0.43% 31.167ms 100.00% 7.262s 7.262s 1 aten::linear 0.04% 2.911ms 97.21% 7.060s 23.533ms 300 aten::t 0.03% 2.414ms 0.07% 4.892ms 16.307us 300 aten::relu 0.03% 2.281ms 2.34% 169.934ms 849.670us 200 ====================== ============ ============ ============ ============ ============ ============ **Self CPU time total:** 7.262s This is around ``2x (7.262s vs 16.201s)`` performance improvement with the ``bfloat16`` fastmath kernels. Next, let’s look at the smaller batch dimension scenario. **Scenario 2:** A smaller batch dimension, for example, 32: .. code-block:: python X = torch.rand(32, 64, 64, device=device) with torch.set_grad_enabled(False): for _ in range(50): model(X) #Warmup with profile(activities=[ProfilerActivity.CPU]) as prof: with record_function("mymodel_inference"): for _ in range(100): model(X) print(prof.key_averages().table(sort_by="self_cpu_time_total")) You should see the following profiler output when the above script is run with the PyTorch default configuration: .. table:: :widths: auto ====================== ============= ============ ============ ============ ============ ============ Name Self CPU % Self CPU CPU total % CPU total CPU time avg # of Calls ====================== ============= ============ ============ ============ ============ ============ aten::addmm 95.51% 5.821s 97.04% 5.914s 19.713ms 300 aten::clamp_min 2.33% 142.244ms 2.33% 142.244ms 711.220us 200 aten::copy 1.51% 92.322ms 1.51% 92.322ms 307.740us 300 mymodel_inference 0.45% 27.713ms 100.00% 6.094s 6.094s 1 aten::linear 0.04% 2.495ms 97.16% 5.921s 19.736ms 300 aten::t 0.03% 2.131ms 0.07% 4.441ms 14.803us 300 aten::relu 0.03% 1.942ms 2.37% 144.186ms 720.930us 200 ====================== ============= ============ ============ ============ ============ ============ **Self CPU time total:** 6.094s The following output is the profiler output when run with the MKLDNN fast math mode enabled: .. code-block:: bash $ export DNNL_DEFAULT_FPMATH_MODE=BF16 .. table:: :widths: auto ====================== ============ ============ ============ ============ ============ ============= Name Self CPU % Self CPU CPU total % CPU total CPU time avg # of Calls ====================== ============ ============ ============ ============ ============ ============= aten::addmm 93.31% 3.848s 95.66% 3.944s 13.148ms 300 aten::clamp_min 3.43% 141.309ms 3.43% 141.309ms 706.545us 200 aten::copy 2.33% 95.916ms 2.33% 95.916ms 319.720us 300 mymodel_inference 0.67% 27.431ms 100.00% 4.123s 4.123s 1 aten::linear 0.06% 2.471ms 95.83% 3.951s 13.170ms 300 aten::t 0.05% 2.027ms 0.10% 4.243ms 14.143us 300 aten::relu 0.05% 1.928ms 3.47% 143.237ms 716.185us 200 ====================== ============ ============ ============ ============ ============ ============= **Self CPU time total:** 4.123s The MKLDNN fast math mode yields approximately a **1.47x (4.123s vs 6.094s)** performance improvement for smaller batch dimensions. Although this improvement is noteworthy, the overall performance still leaves room for improvement. This is because of the runtime overhead (weights reorders and kernel launch time) from oneDNN and ACL backend outweighing the compute benefits from the ACL GEMM kernels for the smaller batch compute. Improve Inference Performance with OpenBLAS for Smaller Batch Dimensions ------------------------------------------------------------------------ The inference performance for smaller batch dimensions can be improved by offloading the smaller shapes from MKLDNN to OpenBLAS backend. We are working on making the backend selection automatic, with robust heuristics, for the future releases. Till the heuristics are implemented, the smaller shapes can be offloaded to OpenBLAS by increasing the threshold for MKLDNN backend selection. In the following example, we use ``64`` as the threshold, so that input with ``batch dimension of 32`` is not dispatched to MKLDNN. Instead, it is dispatched to OpenBLAS. .. code-block:: bash $ export TORCH_MKLDNN_MATMUL_MIN_DIM=64 Here is the profiler output with OpenBLAS backend: .. table:: :widths: auto ====================== ============ ============ ============ ============= ============ ============= Name Self CPU % Self CPU CPU total % CPU total CPU time avg # of Calls ====================== ============ ============ ============ ============= ============ ============= aten::addmm 96.25% 1.958s 97.51% 1.984s 6.612ms 300 aten::clamp_min 1.28% 26.124ms 1.28% 26.124ms 130.620us 200 aten::copy 1.23% 24.951ms 1.23% 24.951ms 83.170us 300 mymodel_inference 0.86% 17.423ms 100.00% 2.034s 2.034s 1 aten::linear 0.08% 1.691ms 97.74% 1.988s 6.628ms 300 aten::t 0.07% 1.520ms 0.14% 2.945ms 9.817us 300 aten::relu 0.06% 1.258ms 1.35% 27.382ms 136.910us 200 ====================== ============ ============ ============ ============= ============ ============= **Self CPU time total:** 2.034s As you can see above, switching to OpenBLAS doubled the performance **(2.034s vs 4.123s)** compared to the default MKLDNN backend configuration. This becomes significant for even smaller batch dimensions, for example, for a batch dimension of 10: .. code-block:: python X = torch.rand(10, 64, 64, device=device) with torch.set_grad_enabled(False): for _ in range(50): model(X) #Warmup with profile(activities=[ProfilerActivity.CPU]) as prof: with record_function("mymodel_inference"): for _ in range(100): model(X) print(prof.key_averages().table(sort_by="self_cpu_time_total")) The following is the profiler output with MKLDNN fast math mode: .. table:: :widths: auto ====================== ============ ============ ============ ============ ============= ============= Name Self CPU % Self CPU CPU total % CPU total CPU time avg # of Calls ====================== ============ ============ ============ ============ ============= ============= aten::addmm 87.81% 3.613s 91.90% 3.781s 12.604ms 300 aten::clamp_min 7.18% 295.437ms 7.18% 295.437ms 1.477ms 200 aten::copy 4.07% 167.516ms 4.07% 167.516ms 558.387us 300 mymodel_inference 0.67% 27.708ms 100.00% 4.115s 4.115s 1 aten::linear 0.06% 2.499ms 92.06% 3.788s 12.627ms 300 aten::t 0.05% 1.982ms 0.11% 4.385ms 14.617us 300 aten::relu 0.05% 1.932ms 7.23% 297.369ms 1.487ms 200 ====================== ============ ============ ============ ============ ============= ============= **Self CPU time total:** 4.115s and the following is the profiler output with the OpenBLAS backend: .. code-block:: bash $ export TORCH_MKLDNN_MATMUL_MIN_DIM=64 .. table:: :widths: auto ====================== ============= ============ ============ ============ ============= ============ Name Self CPU % Self CPU CPU total % CPU total CPU time avg # of Calls ====================== ============= ============ ============ ============ ============= ============ aten::addmm 92.66% 1.179s 95.23% 1.211s 4.038ms 300 aten::clamp_min 2.83% 36.060ms 2.83% 36.060ms 180.300us 200 aten::copy 2.52% 32.013ms 2.52% 32.013ms 106.710us 300 mymodel_inference 1.38% 17.521ms 100.00% 1.272s 1.272s 1 aten::linear 0.14% 1.750ms 95.60% 1.216s 4.054ms 300 aten::t 0.12% 1.475ms 0.24% 3.033ms 10.110us 300 aten::relu 0.10% 1.285ms 2.94% 37.345ms 186.725us 200 ====================== ============= ============ ============ ============ ============= ============ **Self CPU time total:** 1.272s Here we observed **3.2x (1.272s vs 4.115s)** performance improvement by tuning the backend thresholds appropriately. Optimize Memory Allocation Overhead with Linux Transparent Huge Pages (THP) --------------------------------------------------------------------------- We also observed that for these larger networks, tensor memory allocations take significant portion of the inference latency. This can be optimized by enabling Linux transparent huge page allocations from PyTorch C10 memory allocator. Currently the feature is not enabled by default because it will increase the memory footprint marginally. Set the following environment variable to enable it: .. code-block:: bash $ export THP_MEM_ALLOC_ENABLE=1 For the batch dimension of 256 and with MKLDNN fast math mode: .. code-block:: python X = torch.rand(256, 64, 64, device=device) with torch.set_grad_enabled(False): for _ in range(50): model(X) #Warmup with profile(activities=[ProfilerActivity.CPU]) as prof: with record_function("mymodel_inference"): for _ in range(100): model(X) print(prof.key_averages().table(sort_by="self_cpu_time_total")) The following is the profiler output with THP memory allocations enabled: .. table:: :widths: auto ====================== ============ ============ ============ ============ ============== ============ Name Self CPU % Self CPU CPU total % CPU total CPU time avg # of Calls ====================== ============ ============ ============ ============ ============== ============ aten::addmm 91.31% 6.115s 94.39% 6.321s 21.069ms 300 aten::clamp_min 4.82% 322.568ms 4.82% 322.568ms 1.613ms 200 aten::copy 3.06% 204.602ms 3.06% 204.602ms 682.007us 300 mymodel_inference 0.61% 40.777ms 100.00% 6.697s 6.697s 1 aten::linear 0.05% 3.082ms 94.51% 6.329s 21.097ms 300 aten::relu 0.04% 2.547ms 4.85% 325.115ms 1.626ms 200 ====================== ============ ============ ============ ============ ============== ============ **Self CPU time total:** 6.697s This is an additional **1.08x or 8% (6.697s vs 7.262s)** improvement on top of the already optimized MKLDNN fast math mode measured above. Conclusion ------------ In this tutorial, we covered PyTorch inference on AWS Graviton3 instances by covering the basic usage, demonstrating speedups with fast math kernels, comparing different backends for different batch dimensions, and how to optimize tensor memory allocation latencies with Linux transparent huge pages. The recommendation is to use MKLDNN backend with Bfloat16 fastmath mode and THP memory allocations for larger tensor shapes and to use OpenBLAS backend for smaller tensor shapes. We hope that you will give it a try!