In [1]:
# Copyright 2019 NVIDIA Corporation. All Rights Reserved.
#
# Licensed under the Apache License, Version 2.0 (the "License");
# you may not use this file except in compliance with the License.
# You may obtain a copy of the License at
#
# http://www.apache.org/licenses/LICENSE-2.0
#
# Unless required by applicable law or agreed to in writing, software
# distributed under the License is distributed on an "AS IS" BASIS,
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
# See the License for the specific language governing permissions and
# limitations under the License.
# ==============================================================================



# Torch-TensorRT Getting Started - CitriNet

## Overview

[Citrinet](https://docs.nvidia.com/deeplearning/nemo/user-guide/docs/en/main/asr/models.html#citrinet) is an acoustic model used for the speech to text recognition task. It is a version of [QuartzNet](https://arxiv.org/pdf/1910.10261.pdf) that extends [ContextNet](https://arxiv.org/pdf/2005.03191.pdf), utilizing subword encoding (via Word Piece tokenization) and Squeeze-and-Excitation(SE) mechanism and are therefore smaller than QuartzNet models.

CitriNet models take in audio segments and transcribe them to letter, byte pair, or word piece sequences. 

"alt"


### Learning objectives

This notebook demonstrates the steps for optimizing a pretrained CitriNet model with Torch-TensorRT, and running it to test the speedup obtained.

## Content
1. [Requirements](#1)
1. [Download Citrinet model](#2)
1. [Create Torch-TensorRT modules](#3)
1. [Benchmark Torch-TensorRT models](#4)
1. [Conclusion](#5)


## 1. Requirements

Follow the steps in [README](README.md) to prepare a Docker container, within which you can run this notebook. 
This notebook assumes that you are within a Jupyter environment in a docker container with Torch-TensorRT installed, such as an NGC monthly release of `nvcr.io/nvidia/pytorch:-py3` (where `yy` indicates the last two numbers of a calendar year, and `mm` indicates the month in two-digit numerical form)

Now that you are in the docker, the next step is to install the required dependencies.

In [2]:
# Install dependencies
!pip install wget
!apt-get update && DEBIAN_FRONTEND=noninteractive apt-get install -y libsndfile1 ffmpeg
!pip install Cython

## Install NeMo
!pip install nemo_toolkit[all]==1.5.1

Looking in indexes: https://pypi.org/simple, https://pypi.ngc.nvidia.com
Hit:1 http://security.ubuntu.com/ubuntu focal-security InRelease
Hit:2 http://archive.ubuntu.com/ubuntu focal InRelease
Hit:3 http://archive.ubuntu.com/ubuntu focal-updates InRelease
Hit:4 http://archive.ubuntu.com/ubuntu focal-backports InRelease
Reading package lists... Done
Reading package lists... Done
Building dependency tree 
Reading state information... Done
libsndfile1 is already the newest version (1.0.28-7ubuntu0.1).
ffmpeg is already the newest version (7:4.2.4-1ubuntu0.1).
0 upgraded, 0 newly installed, 0 to remove and 22 not upgraded.
Looking in indexes: https://pypi.org/simple, https://pypi.ngc.nvidia.com
Looking in indexes: https://pypi.org/simple, https://pypi.ngc.nvidia.com











## 2. Download Citrinet model

Next, we download a pretrained Nemo Citrinet model and convert it to a Torchscript module:

In [3]:
import nemo
import torch

import nemo.collections.asr as nemo_asr
from nemo.core import typecheck
typecheck.set_typecheck_enabled(False) 

In [4]:
variant = 'stt_en_citrinet_256'

print(f"Downloading and saving {variant}...")
asr_model = nemo_asr.models.EncDecCTCModelBPE.from_pretrained(model_name=variant)
asr_model.export(f"{variant}.ts")

Downloading and saving stt_en_citrinet_256...
[NeMo I 2022-04-21 23:12:45 cloud:56] Found existing object /root/.cache/torch/NeMo/NeMo_1.5.1/stt_en_citrinet_256/91a9cc5850784b2065e8a0aa3d526fd9/stt_en_citrinet_256.nemo.
[NeMo I 2022-04-21 23:12:45 cloud:62] Re-using file from: /root/.cache/torch/NeMo/NeMo_1.5.1/stt_en_citrinet_256/91a9cc5850784b2065e8a0aa3d526fd9/stt_en_citrinet_256.nemo
[NeMo I 2022-04-21 23:12:45 common:728] Instantiating model from pre-trained checkpoint
[NeMo I 2022-04-21 23:12:46 mixins:146] Tokenizer SentencePieceTokenizer initialized with 1024 tokens


[NeMo W 2022-04-21 23:12:47 modelPT:130] If you intend to do training or fine-tuning, please call the ModelPT.setup_training_data() method and provide a valid configuration file to setup the train data loader.
 Train config : 
 manifest_filepath: null
 sample_rate: 16000
 batch_size: 32
 trim_silence: true
 max_duration: 16.7
 shuffle: true
 is_tarred: false
 tarred_audio_filepaths: null
 use_start_end_token: false
 
[NeMo W 2022-04-21 23:12:47 modelPT:137] If you intend to do validation, please call the ModelPT.setup_validation_data() or ModelPT.setup_multiple_validation_data() method and provide a valid configuration file to setup the validation data loader(s). 
 Validation config : 
 manifest_filepath: null
 sample_rate: 16000
 batch_size: 32
 shuffle: false
 use_start_end_token: false
 
[NeMo W 2022-04-21 23:12:47 modelPT:143] Please call the ModelPT.setup_test_data() or ModelPT.setup_multiple_test_data() method and provide a valid configuration file to setup the test data loader(s

[NeMo I 2022-04-21 23:12:47 features:265] PADDING: 16
[NeMo I 2022-04-21 23:12:47 features:282] STFT using torch


 librosa.filters.mel(sample_rate, self.n_fft, n_mels=nfilt, fmin=lowfreq, fmax=highfreq), dtype=torch.float
 


[NeMo I 2022-04-21 23:12:49 save_restore_connector:149] Model EncDecCTCModelBPE was successfully restored from /root/.cache/torch/NeMo/NeMo_1.5.1/stt_en_citrinet_256/91a9cc5850784b2065e8a0aa3d526fd9/stt_en_citrinet_256.nemo.


[NeMo W 2022-04-21 23:12:49 export_utils:198] Swapped 0 modules
[NeMo W 2022-04-21 23:12:49 conv_asr:73] Turned off 235 masked convolutions
[NeMo W 2022-04-21 23:12:49 export_utils:198] Swapped 0 modules
 
 if hasattr(mod, name):
 
 item = getattr(mod, name)
 
 if hasattr(mod, name):
 
 item = getattr(mod, name)
 


(['stt_en_citrinet_256.ts'],
 ['nemo.collections.asr.models.ctc_bpe_models.EncDecCTCModelBPE exported to ONNX'])

### Benchmark utility

Let us define a helper benchmarking function, then benchmark the original Pytorch model.

In [5]:
from __future__ import print_function
from __future__ import absolute_import
from __future__ import division

import argparse
import timeit
import numpy as np
import torch
import torch_tensorrt as trtorch
import torch.backends.cudnn as cudnn

def benchmark(model, input_tensor, num_loops, model_name, batch_size):
 def timeGraph(model, input_tensor, num_loops):
 print("Warm up ...")
 with torch.no_grad():
 for _ in range(20):
 features = model(input_tensor)

 torch.cuda.synchronize()
 print("Start timing ...")
 timings = []
 with torch.no_grad():
 for i in range(num_loops):
 start_time = timeit.default_timer()
 features = model(input_tensor)
 torch.cuda.synchronize()
 end_time = timeit.default_timer()
 timings.append(end_time - start_time)
 # print("Iteration {}: {:.6f} s".format(i, end_time - start_time))
 return timings
 def printStats(graphName, timings, batch_size):
 times = np.array(timings)
 steps = len(times)
 speeds = batch_size / times
 time_mean = np.mean(times)
 time_med = np.median(times)
 time_99th = np.percentile(times, 99)
 time_std = np.std(times, ddof=0)
 speed_mean = np.mean(speeds)
 speed_med = np.median(speeds)
 msg = ("\n%s =================================\n"
 "batch size=%d, num iterations=%d\n"
 " Median samples/s: %.1f, mean: %.1f\n"
 " Median latency (s): %.6f, mean: %.6f, 99th_p: %.6f, std_dev: %.6f\n"
 ) % (graphName,
 batch_size, steps,
 speed_med, speed_mean,
 time_med, time_mean, time_99th, time_std)
 print(msg)
 timings = timeGraph(model, input_tensor, num_loops)
 printStats(model_name, timings, batch_size)

precisions_str = 'fp32' # Precision (default=fp32, fp16)
variant = 'stt_en_citrinet_256' # Nemo Citrinet variant
batch_sizes = [1, 8, 32, 128] # Batch sizes (default=1,8,32,128)
trt = False # If True, infer with Torch-TensorRT engine. Else, infer with Pytorch model.
precision = torch.float32 if precisions_str =='fp32' else torch.float16

for batch_size in batch_sizes:
 if trt:
 model_name = f"{variant}_bs{batch_size}_{precision}.torch-tensorrt"
 else:
 model_name = f"{variant}.ts"

 print(f"Loading model: {model_name}") 
 # Load traced model to CPU first
 model = torch.jit.load(model_name).cuda()
 cudnn.benchmark = True
 # Create random input tensor of certain size
 torch.manual_seed(12345)
 input_shape=(batch_size, 80, 1488)
 input_tensor = torch.randn(input_shape).cuda()

 # Timing graph inference
 benchmark(model, input_tensor, 50, model_name, batch_size)

Loading model: stt_en_citrinet_256.ts
Warm up ...
Start timing ...

batch size=1, num iterations=50
 Median samples/s: 102.0, mean: 102.0
 Median latency (s): 0.009802, mean: 0.009803, 99th_p: 0.009836, std_dev: 0.000014

Loading model: stt_en_citrinet_256.ts
Warm up ...
Start timing ...

batch size=8, num iterations=50
 Median samples/s: 429.1, mean: 429.1
 Median latency (s): 0.018642, mean: 0.018643, 99th_p: 0.018670, std_dev: 0.000014

Loading model: stt_en_citrinet_256.ts
Warm up ...
Start timing ...

batch size=32, num iterations=50
 Median samples/s: 551.3, mean: 551.2
 Median latency (s): 0.058047, mean: 0.058053, 99th_p: 0.058375, std_dev: 0.000106

Loading model: stt_en_citrinet_256.ts
Warm up ...
Start timing ...

batch size=128, num iterations=50
 Median samples/s: 594.1, mean: 594.1
 Median latency (s): 0.215434, mean: 0.215446, 99th_p: 0.215806, std_dev: 0.000116



Confirming the GPU we are using here:

In [6]:
!nvidia-smi

Thu Apr 21 23:13:32 2022 
+-----------------------------------------------------------------------------+
| NVIDIA-SMI 510.47.03 Driver Version: 510.47.03 CUDA Version: 11.6 |
|-------------------------------+----------------------+----------------------+
| GPU Name Persistence-M| Bus-Id Disp.A | Volatile Uncorr. ECC |
| Fan Temp Perf Pwr:Usage/Cap| Memory-Usage | GPU-Util Compute M. |
| | | MIG M. |
| 0 NVIDIA TITAN V On | 00000000:17:00.0 Off | N/A |
| 38% 55C P2 42W / 250W | 2462MiB / 12288MiB | 0% Default |
| | | N/A |
+-------------------------------+----------------------+----------------------+
| 1 NVIDIA TITAN V On | 00000000:65:00.0 Off | N/A |
| 28% 39C P8 26W / 250W | 112MiB / 12288MiB | 0% Default |
| | | N/A |
+-------------------------------+----------------------+----------------------+
 
+-----------------------------------------------------------------------------+
| Processes: |
| GPU GI CI PID Type Process name GPU Memory |
| ID ID Usage |
| 0 N/A N/A 3909 G 4MiB |
|


## 3. Create Torch-TensorRT modules

In this step, we optimize the Citrinet Torchscript module with Torch-TensorRT with various precisions and batch sizes.

In [10]:
import torch
import torch.nn as nn
import torch_tensorrt as torchtrt
import argparse

variant = "stt_en_citrinet_256"
precisions = [torch.float, torch.half]
batch_sizes = [1,8,32,128]

model = torch.jit.load(f"{variant}.ts")

for precision in precisions:
 for batch_size in batch_sizes:
 compile_settings = {
 "inputs": [torchtrt.Input(shape=[batch_size, 80, 1488])],
 "enabled_precisions": {precision},
 "workspace_size": 2000000000,
 "truncate_long_and_double": True,
 }
 print(f"Generating Torchscript-TensorRT module for batchsize {batch_size} precision {precision}")
 trt_ts_module = torchtrt.compile(model, **compile_settings)
 torch.jit.save(trt_ts_module, f"{variant}_bs{batch_size}_{precision}.torch-tensorrt")

Generating Torchscript-TensorRT module for batchsize 1 precision torch.float32
Generating Torchscript-TensorRT module for batchsize 8 precision torch.float32
Generating Torchscript-TensorRT module for batchsize 32 precision torch.float32
Generating Torchscript-TensorRT module for batchsize 128 precision torch.float32
Generating Torchscript-TensorRT module for batchsize 1 precision torch.float16
Generating Torchscript-TensorRT module for batchsize 8 precision torch.float16
Generating Torchscript-TensorRT module for batchsize 32 precision torch.float16
Generating Torchscript-TensorRT module for batchsize 128 precision torch.float16



## 4. Benchmark Torch-TensorRT models

Finally, we are ready to benchmark the Torch-TensorRT optimized Citrinet models.

### FP32 (single precision)

In [13]:
precisions_str = 'fp32' # Precision (default=fp32, fp16)
batch_sizes = [1, 8, 32, 128] # Batch sizes (default=1,8,32,128)
precision = torch.float32 if precisions_str =='fp32' else torch.float16
trt = True

for batch_size in batch_sizes:
 if trt:
 model_name = f"{variant}_bs{batch_size}_{precision}.torch-tensorrt"
 else:
 model_name = f"{variant}.ts"

 print(f"Loading model: {model_name}") 
 # Load traced model to CPU first
 model = torch.jit.load(model_name).cuda()
 cudnn.benchmark = True
 # Create random input tensor of certain size
 torch.manual_seed(12345)
 input_shape=(batch_size, 80, 1488)
 input_tensor = torch.randn(input_shape).cuda()

 # Timing graph inference
 benchmark(model, input_tensor, 50, model_name, batch_size)

Loading model: stt_en_citrinet_256_bs1_torch.float32.torch-tensorrt
Warm up ...
Start timing ...

batch size=1, num iterations=50
 Median samples/s: 242.2, mean: 218.0
 Median latency (s): 0.004128, mean: 0.004825, 99th_p: 0.008071, std_dev: 0.001270

Loading model: stt_en_citrinet_256_bs8_torch.float32.torch-tensorrt
Warm up ...
Start timing ...

batch size=8, num iterations=50
 Median samples/s: 729.9, mean: 709.0
 Median latency (s): 0.010961, mean: 0.011388, 99th_p: 0.016114, std_dev: 0.001256

Loading model: stt_en_citrinet_256_bs32_torch.float32.torch-tensorrt
Warm up ...
Start timing ...

batch size=32, num iterations=50
 Median samples/s: 955.6, mean: 953.4
 Median latency (s): 0.033488, mean: 0.033572, 99th_p: 0.035722, std_dev: 0.000545

Loading model: stt_en_citrinet_256_bs128_torch.float32.torch-tensorrt
Warm up ...
Start timing ...

batch size=128, num iterations=50
 Median samples/s: 1065.8, mean: 1069.4
 Median latency (s): 0.120097, mean: 0.119708, 99th_p: 0.121618, std

### FP16 (half precision)

In [14]:
precisions_str = 'fp16' # Precision (default=fp32, fp16)
batch_sizes = [1, 8, 32, 128] # Batch sizes (default=1,8,32,128)
precision = torch.float32 if precisions_str =='fp32' else torch.float16

for batch_size in batch_sizes:
 if trt:
 model_name = f"{variant}_bs{batch_size}_{precision}.torch-tensorrt"
 else:
 model_name = f"{variant}.ts"

 print(f"Loading model: {model_name}") 
 # Load traced model to CPU first
 model = torch.jit.load(model_name).cuda()
 cudnn.benchmark = True
 # Create random input tensor of certain size
 torch.manual_seed(12345)
 input_shape=(batch_size, 80, 1488)
 input_tensor = torch.randn(input_shape).cuda()

 # Timing graph inference
 benchmark(model, input_tensor, 50, model_name, batch_size)

Loading model: stt_en_citrinet_256_bs1_torch.float16.torch-tensorrt
Warm up ...
Start timing ...

batch size=1, num iterations=50
 Median samples/s: 288.9, mean: 272.9
 Median latency (s): 0.003462, mean: 0.003774, 99th_p: 0.006846, std_dev: 0.000820

Loading model: stt_en_citrinet_256_bs8_torch.float16.torch-tensorrt
Warm up ...
Start timing ...

batch size=8, num iterations=50
 Median samples/s: 1201.0, mean: 1190.9
 Median latency (s): 0.006661, mean: 0.006733, 99th_p: 0.008453, std_dev: 0.000368

Loading model: stt_en_citrinet_256_bs32_torch.float16.torch-tensorrt
Warm up ...
Start timing ...

batch size=32, num iterations=50
 Median samples/s: 1538.2, mean: 1516.4
 Median latency (s): 0.020804, mean: 0.021143, 99th_p: 0.024492, std_dev: 0.000973

Loading model: stt_en_citrinet_256_bs128_torch.float16.torch-tensorrt
Warm up ...
Start timing ...

batch size=128, num iterations=50
 Median samples/s: 1792.0, mean: 1777.0
 Median latency (s): 0.071428, mean: 0.072057, 99th_p: 0.076796,


## 5. Conclusion

In this notebook, we have walked through the complete process of optimizing the Citrinet model with Torch-TensorRT. On an A100 GPU, with Torch-TensorRT, we observe a speedup of ~**2.4X** with FP32, and ~**2.9X** with FP16 at batchsize of 128.

### What's next
Now it's time to try Torch-TensorRT on your own model. Fill out issues at https://github.com/NVIDIA/Torch-TensorRT. Your involvement will help future development of Torch-TensorRT.
