Note
Click here to download the full example code
Optimizing Vision Transformer Model for Deployment¶
Created On: Mar 15, 2021 | Last Updated: Jan 19, 2024 | Last Verified: Nov 05, 2024
Vision Transformer models apply the cutting-edge attention-based transformer models, introduced in Natural Language Processing to achieve all kinds of the state of the art (SOTA) results, to Computer Vision tasks. Facebook Data-efficient Image Transformers DeiT is a Vision Transformer model trained on ImageNet for image classification.
In this tutorial, we will first cover what DeiT is and how to use it, then go through the complete steps of scripting, quantizing, optimizing, and using the model in iOS and Android apps. We will also compare the performance of quantized, optimized and non-quantized, non-optimized models, and show the benefits of applying quantization and optimization to the model along the steps.
What is DeiT¶
Convolutional Neural Networks (CNNs) have been the main models for image classification since deep learning took off in 2012, but CNNs typically require hundreds of millions of images for training to achieve the SOTA results. DeiT is a vision transformer model that requires a lot less data and computing resources for training to compete with the leading CNNs in performing image classification, which is made possible by two key components of of DeiT:
Data augmentation that simulates training on a much larger dataset;
Native distillation that allows the transformer network to learn from a CNN’s output.
DeiT shows that Transformers can be successfully applied to computer vision tasks, with limited access to data and resources. For more details on DeiT, see the repo and paper.
Classifying Images with DeiT¶
Follow the README.md
at the DeiT repository for detailed information on how to
classify images using DeiT, or for a quick test, first install the
required packages:
pip install torch torchvision timm pandas requests
To run in Google Colab, install dependencies by running the following command:
!pip install timm pandas requests
then run the script below:
from PIL import Image
import torch
import timm
import requests
import torchvision.transforms as transforms
from timm.data.constants import IMAGENET_DEFAULT_MEAN, IMAGENET_DEFAULT_STD
print(torch.__version__)
# should be 1.8.0
model = torch.hub.load('facebookresearch/deit:main', 'deit_base_patch16_224', pretrained=True)
model.eval()
transform = transforms.Compose([
transforms.Resize(256, interpolation=3),
transforms.CenterCrop(224),
transforms.ToTensor(),
transforms.Normalize(IMAGENET_DEFAULT_MEAN, IMAGENET_DEFAULT_STD),
])
img = Image.open(requests.get("https://raw.githubusercontent.com/pytorch/ios-demo-app/master/HelloWorld/HelloWorld/HelloWorld/image.png", stream=True).raw)
img = transform(img)[None,]
out = model(img)
clsidx = torch.argmax(out)
print(clsidx.item())
2.5.0+cu124
Downloading: "https://github.com/facebookresearch/deit/zipball/main" to /var/lib/ci-user/.cache/torch/hub/main.zip
/usr/local/lib/python3.10/dist-packages/timm/models/registry.py:4: FutureWarning:
Importing from timm.models.registry is deprecated, please import via timm.models
/usr/local/lib/python3.10/dist-packages/timm/models/layers/__init__.py:48: FutureWarning:
Importing from timm.models.layers is deprecated, please import via timm.layers
/var/lib/ci-user/.cache/torch/hub/facebookresearch_deit_main/models.py:63: UserWarning:
Overwriting deit_tiny_patch16_224 in registry with models.deit_tiny_patch16_224. This is because the name being registered conflicts with an existing name. Please check if this is not expected.
/var/lib/ci-user/.cache/torch/hub/facebookresearch_deit_main/models.py:78: UserWarning:
Overwriting deit_small_patch16_224 in registry with models.deit_small_patch16_224. This is because the name being registered conflicts with an existing name. Please check if this is not expected.
/var/lib/ci-user/.cache/torch/hub/facebookresearch_deit_main/models.py:93: UserWarning:
Overwriting deit_base_patch16_224 in registry with models.deit_base_patch16_224. This is because the name being registered conflicts with an existing name. Please check if this is not expected.
/var/lib/ci-user/.cache/torch/hub/facebookresearch_deit_main/models.py:108: UserWarning:
Overwriting deit_tiny_distilled_patch16_224 in registry with models.deit_tiny_distilled_patch16_224. This is because the name being registered conflicts with an existing name. Please check if this is not expected.
/var/lib/ci-user/.cache/torch/hub/facebookresearch_deit_main/models.py:123: UserWarning:
Overwriting deit_small_distilled_patch16_224 in registry with models.deit_small_distilled_patch16_224. This is because the name being registered conflicts with an existing name. Please check if this is not expected.
/var/lib/ci-user/.cache/torch/hub/facebookresearch_deit_main/models.py:138: UserWarning:
Overwriting deit_base_distilled_patch16_224 in registry with models.deit_base_distilled_patch16_224. This is because the name being registered conflicts with an existing name. Please check if this is not expected.
/var/lib/ci-user/.cache/torch/hub/facebookresearch_deit_main/models.py:153: UserWarning:
Overwriting deit_base_patch16_384 in registry with models.deit_base_patch16_384. This is because the name being registered conflicts with an existing name. Please check if this is not expected.
/var/lib/ci-user/.cache/torch/hub/facebookresearch_deit_main/models.py:168: UserWarning:
Overwriting deit_base_distilled_patch16_384 in registry with models.deit_base_distilled_patch16_384. This is because the name being registered conflicts with an existing name. Please check if this is not expected.
Downloading: "https://dl.fbaipublicfiles.com/deit/deit_base_patch16_224-b5f2ef4d.pth" to /var/lib/ci-user/.cache/torch/hub/checkpoints/deit_base_patch16_224-b5f2ef4d.pth
0%| | 0.00/330M [00:00<?, ?B/s]
5%|4 | 15.6M/330M [00:00<00:02, 163MB/s]
10%|# | 33.8M/330M [00:00<00:01, 179MB/s]
16%|#5 | 52.0M/330M [00:00<00:01, 184MB/s]
21%|##1 | 70.1M/330M [00:00<00:01, 186MB/s]
27%|##6 | 88.5M/330M [00:00<00:01, 188MB/s]
32%|###2 | 107M/330M [00:00<00:01, 190MB/s]
38%|###7 | 125M/330M [00:00<00:01, 190MB/s]
44%|####3 | 144M/330M [00:00<00:01, 191MB/s]
49%|####9 | 162M/330M [00:00<00:00, 192MB/s]
55%|#####4 | 181M/330M [00:01<00:00, 193MB/s]
60%|###### | 200M/330M [00:01<00:00, 194MB/s]
66%|######6 | 218M/330M [00:01<00:00, 194MB/s]
72%|#######1 | 237M/330M [00:01<00:00, 194MB/s]
77%|#######7 | 255M/330M [00:01<00:00, 194MB/s]
83%|########2 | 274M/330M [00:01<00:00, 194MB/s]
88%|########8 | 292M/330M [00:01<00:00, 194MB/s]
94%|#########4| 311M/330M [00:01<00:00, 194MB/s]
100%|#########9| 330M/330M [00:01<00:00, 195MB/s]
100%|##########| 330M/330M [00:01<00:00, 191MB/s]
269
The output should be 269, which, according to the ImageNet list of class
index to labels file, maps to timber
wolf, grey wolf, gray wolf, Canis lupus
.
Now that we have verified that we can use the DeiT model to classify images, let’s see how to modify the model so it can run on iOS and Android apps.
Scripting DeiT¶
To use the model on mobile, we first need to script the model. See the Script and Optimize recipe for a quick overview. Run the code below to convert the DeiT model used in the previous step to the TorchScript format that can run on mobile.
model = torch.hub.load('facebookresearch/deit:main', 'deit_base_patch16_224', pretrained=True)
model.eval()
scripted_model = torch.jit.script(model)
scripted_model.save("fbdeit_scripted.pt")
Using cache found in /var/lib/ci-user/.cache/torch/hub/facebookresearch_deit_main
The scripted model file fbdeit_scripted.pt
of size about 346MB is
generated.
Quantizing DeiT¶
To reduce the trained model size significantly while keeping the inference accuracy about the same, quantization can be applied to the model. Thanks to the transformer model used in DeiT, we can easily apply dynamic-quantization to the model, because dynamic quantization works best for LSTM and transformer models (see here for more details).
Now run the code below:
# Use 'x86' for server inference (the old 'fbgemm' is still available but 'x86' is the recommended default) and ``qnnpack`` for mobile inference.
backend = "x86" # replaced with ``qnnpack`` causing much worse inference speed for quantized model on this notebook
model.qconfig = torch.quantization.get_default_qconfig(backend)
torch.backends.quantized.engine = backend
quantized_model = torch.quantization.quantize_dynamic(model, qconfig_spec={torch.nn.Linear}, dtype=torch.qint8)
scripted_quantized_model = torch.jit.script(quantized_model)
scripted_quantized_model.save("fbdeit_scripted_quantized.pt")
/usr/local/lib/python3.10/dist-packages/torch/ao/quantization/observer.py:229: UserWarning:
Please use quant_min and quant_max to specify the range for observers. reduce_range will be deprecated in a future release of PyTorch.
This generates the scripted and quantized version of the model
fbdeit_quantized_scripted.pt
, with size about 89MB, a 74% reduction of
the non-quantized model size of 346MB!
You can use the scripted_quantized_model
to generate the same
inference result:
269
Optimizing DeiT¶
The final step before using the quantized and scripted model on mobile is to optimize it:
from torch.utils.mobile_optimizer import optimize_for_mobile
optimized_scripted_quantized_model = optimize_for_mobile(scripted_quantized_model)
optimized_scripted_quantized_model.save("fbdeit_optimized_scripted_quantized.pt")
The generated fbdeit_optimized_scripted_quantized.pt
file has about the
same size as the quantized, scripted, but non-optimized model. The
inference result remains the same.
269
Using Lite Interpreter¶
To see how much model size reduction and inference speed up the Lite Interpreter can result in, let’s create the lite version of the model.
optimized_scripted_quantized_model._save_for_lite_interpreter("fbdeit_optimized_scripted_quantized_lite.ptl")
ptl = torch.jit.load("fbdeit_optimized_scripted_quantized_lite.ptl")
Although the lite model size is comparable to the non-lite version, when running the lite version on mobile, the inference speed up is expected.
Comparing Inference Speed¶
To see how the inference speed differs for the four models - the original model, the scripted model, the quantized-and-scripted model, the optimized-quantized-and-scripted model - run the code below:
with torch.autograd.profiler.profile(use_cuda=False) as prof1:
out = model(img)
with torch.autograd.profiler.profile(use_cuda=False) as prof2:
out = scripted_model(img)
with torch.autograd.profiler.profile(use_cuda=False) as prof3:
out = scripted_quantized_model(img)
with torch.autograd.profiler.profile(use_cuda=False) as prof4:
out = optimized_scripted_quantized_model(img)
with torch.autograd.profiler.profile(use_cuda=False) as prof5:
out = ptl(img)
print("original model: {:.2f}ms".format(prof1.self_cpu_time_total/1000))
print("scripted model: {:.2f}ms".format(prof2.self_cpu_time_total/1000))
print("scripted & quantized model: {:.2f}ms".format(prof3.self_cpu_time_total/1000))
print("scripted & quantized & optimized model: {:.2f}ms".format(prof4.self_cpu_time_total/1000))
print("lite model: {:.2f}ms".format(prof5.self_cpu_time_total/1000))
original model: 99.87ms
scripted model: 109.17ms
scripted & quantized model: 124.67ms
scripted & quantized & optimized model: 137.27ms
lite model: 118.68ms
The results running on a Google Colab are:
original model: 1236.69ms
scripted model: 1226.72ms
scripted & quantized model: 593.19ms
scripted & quantized & optimized model: 598.01ms
lite model: 600.72ms
The following results summarize the inference time taken by each model and the percentage reduction of each model relative to the original model.
import pandas as pd
import numpy as np
df = pd.DataFrame({'Model': ['original model','scripted model', 'scripted & quantized model', 'scripted & quantized & optimized model', 'lite model']})
df = pd.concat([df, pd.DataFrame([
["{:.2f}ms".format(prof1.self_cpu_time_total/1000), "0%"],
["{:.2f}ms".format(prof2.self_cpu_time_total/1000),
"{:.2f}%".format((prof1.self_cpu_time_total-prof2.self_cpu_time_total)/prof1.self_cpu_time_total*100)],
["{:.2f}ms".format(prof3.self_cpu_time_total/1000),
"{:.2f}%".format((prof1.self_cpu_time_total-prof3.self_cpu_time_total)/prof1.self_cpu_time_total*100)],
["{:.2f}ms".format(prof4.self_cpu_time_total/1000),
"{:.2f}%".format((prof1.self_cpu_time_total-prof4.self_cpu_time_total)/prof1.self_cpu_time_total*100)],
["{:.2f}ms".format(prof5.self_cpu_time_total/1000),
"{:.2f}%".format((prof1.self_cpu_time_total-prof5.self_cpu_time_total)/prof1.self_cpu_time_total*100)]],
columns=['Inference Time', 'Reduction'])], axis=1)
print(df)
"""
Model Inference Time Reduction
0 original model 1236.69ms 0%
1 scripted model 1226.72ms 0.81%
2 scripted & quantized model 593.19ms 52.03%
3 scripted & quantized & optimized model 598.01ms 51.64%
4 lite model 600.72ms 51.43%
"""
Model ... Reduction
0 original model ... 0%
1 scripted model ... -9.32%
2 scripted & quantized model ... -24.83%
3 scripted & quantized & optimized model ... -37.46%
4 lite model ... -18.84%
[5 rows x 3 columns]
'\n Model Inference Time Reduction\n0\toriginal model 1236.69ms 0%\n1\tscripted model 1226.72ms 0.81%\n2\tscripted & quantized model 593.19ms 52.03%\n3\tscripted & quantized & optimized model 598.01ms 51.64%\n4\tlite model 600.72ms 51.43%\n'