Torch-TensorRT Dynamo Backend

This guide presents Torch-TensorRT dynamo backend which optimizes Pytorch models using TensorRT in an Ahead-Of-Time fashion.

Using the Dynamo backend

Pytorch 2.1 introduced torch.export APIs which can export graphs from Pytorch programs into ExportedProgram objects. Torch-TensorRT dynamo backend compiles these ExportedProgram objects and optimizes them using TensorRT. Here’s a simple usage of the dynamo backend

import torch
import torch_tensorrt

model = MyModel().eval().cuda()
inputs = [torch.randn((1, 3, 224, 224), dtype=torch.float32).cuda()]
exp_program = torch.export.export(model, tuple(inputs))
trt_gm = torch_tensorrt.dynamo.compile(exp_program, inputs) # Output is a torch.fx.GraphModule
trt_gm(*inputs)

Note

torch_tensorrt.dynamo.compile is the main API for users to interact with Torch-TensorRT dynamo backend. The input type of the model should be ExportedProgram (ideally the output of torch.export.export or torch_tensorrt.dynamo.trace (discussed in the section below)) and output type is a torch.fx.GraphModule object.

Customizeable Settings

There are lot of options for users to customize their settings for optimizing with TensorRT. Some of the frequently used options are as follows:

inputs - For static shapes, this can be a list of torch tensors or torch_tensorrt.Input objects. For dynamic shapes, this should be a list of torch_tensorrt.Input objects.
enabled_precisions - Set of precisions that TensorRT builder can use during optimization.
truncate_long_and_double - Truncates long and double values to int and floats respectively.
torch_executed_ops - Operators which are forced to be executed by Torch.
min_block_size - Minimum number of consecutive operators required to be executed as a TensorRT segment.

The complete list of options can be found here

Note

We do not support INT precision currently in Dynamo. Support for this currently exists in

our Torchscript IR. We plan to implement similar support for dynamo in our next release.

Under the hood

Under the hood, torch_tensorrt.dynamo.compile performs the following on the graph.

Lowering - Applies lowering passes to add/remove operators for optimal conversion.
Partitioning - Partitions the graph into Pytorch and TensorRT segments based on the min_block_size and torch_executed_ops field.
Conversion - Pytorch ops get converted into TensorRT ops in this phase.
Optimization - Post conversion, we build the TensorRT engine and embed this inside the pytorch graph.

Tracing

torch_tensorrt.dynamo.trace can be used to trace a Pytorch graphs and produce ExportedProgram. This internally performs some decompositions of operators for downstream optimization. The ExportedProgram can then be used with torch_tensorrt.dynamo.compile API. If you have dynamic input shapes in your model, you can use this torch_tensorrt.dynamo.trace to export the model with dynamic shapes. Alternatively, you can use torch.export with constraints directly as well.

import torch
import torch_tensorrt

inputs = [torch_tensorrt.Input(min_shape=(1, 3, 224, 224),
                              opt_shape=(4, 3, 224, 224),
                              max_shape=(8, 3, 224, 224),
                              dtype=torch.float32)]
model = MyModel().eval()
exp_program = torch_tensorrt.dynamo.trace(model, inputs)

Torch-TensorRT Dynamo Backend

Using the Dynamo backend

Customizeable Settings

Under the hood

Tracing

Docs

Tutorials

Resources