torch_tensorrt.dynamo

Functions

torch_tensorrt.dynamo.compile(exported_program: ExportedProgram, inputs: Optional[Sequence[Sequence[Any]]] = None, *, arg_inputs: Optional[Sequence[Sequence[Any]]] = None, kwarg_inputs: Optional[dict[Any, Any]] = None, device: Optional[Union[Device, device, str]] = None, disable_tf32: bool = False, assume_dynamic_shape_support: bool = False, sparse_weights: bool = False, enabled_precisions: Union[Set[Union[dtype, dtype]], Tuple[Union[dtype, dtype]]] = {dtype.f32}, engine_capability: EngineCapability = EngineCapability.STANDARD, debug: bool = False, num_avg_timing_iters: int = 1, workspace_size: int = 0, dla_sram_size: int = 1048576, dla_local_dram_size: int = 1073741824, dla_global_dram_size: int = 536870912, truncate_double: bool = False, require_full_compilation: bool = False, min_block_size: int = 5, torch_executed_ops: Optional[Collection[Union[Callable[[...], Any], str]]] = None, torch_executed_modules: Optional[List[str]] = None, pass_through_build_failures: bool = False, max_aux_streams: Optional[int] = None, version_compatible: bool = False, optimization_level: Optional[int] = None, use_python_runtime: bool = False, use_fast_partitioner: bool = True, enable_experimental_decompositions: bool = False, dryrun: bool = False, hardware_compatible: bool = False, timing_cache_path: str = '/tmp/torch_tensorrt_engine_cache/timing_cache.bin', lazy_engine_init: bool = False, cache_built_engines: bool = False, reuse_cached_engines: bool = False, engine_cache_dir: str = '/tmp/torch_tensorrt_engine_cache', engine_cache_size: int = 5368709120, custom_engine_cache: Optional[BaseEngineCache] = None, use_explicit_typing: bool = False, use_fp32_acc: bool = False, refit_identical_engine_weights: bool = False, strip_engine_weights: bool = False, immutable_weights: bool = True, enable_weight_streaming: bool = False, tiling_optimization_level: str = 'none', l2_limit_for_tiling: int = - 1, **kwargs: Any) → GraphModule[source]

Compile an ExportedProgram module for NVIDIA GPUs using TensorRT

Takes a existing TorchScript module and a set of settings to configure the compiler and will convert methods to JIT Graphs which call equivalent TensorRT engines

Converts specifically the forward method of a TorchScript Module

Parameters

exported_program (torch.export.ExportedProgram) – Source module, running torch.export on a torch.nn.Module

inputs (Tuple[Any, ...]) –

List of specifications of input shape, dtype and memory layout for inputs to the module. This argument is required. Input Sizes can be specified as torch sizes, tuples or lists. dtypes can be specified using torch datatypes or torch_tensorrt datatypes and you can use either torch devices or the torch_tensorrt device type enum to select device type.

inputs=[
    torch_tensorrt.Input((1, 3, 224, 224)), # Static NCHW input shape for input #1
    torch_tensorrt.Input(
        min_shape=(1, 224, 224, 3),
        opt_shape=(1, 512, 512, 3),
        max_shape=(1, 1024, 1024, 3),
        dtype=torch.int32
        format=torch.channel_last
    ), # Dynamic input shape for input #2
    torch.randn((1, 3, 224, 244)) # Use an example tensor and let torch_tensorrt infer settings
]

Keyword Arguments

arg_inputs (Tuple[Any, ...]) – Same as inputs. Alias for better understanding with kwarg_inputs.
kwarg_inputs (dict[Any, ...]) – Optional, kwarg inputs to the module forward function.
device (Union(Device, torch.device, dict)) –
Target device for TensorRT engines to run on
```
device=torch_tensorrt.Device("dla:1", allow_gpu_fallback=True)
```
disable_tf32 (bool) – Force FP32 layers to use traditional as FP32 format vs the default behavior of rounding the inputs to 10-bit mantissas before multiplying, but accumulates the sum using 23-bit mantissas
assume_dynamic_shape_support (bool) – Setting this to true enables the converters work for both dynamic and static shapes. Default: False
sparse_weights (bool) – Enable sparsity for convolution and fully connected layers.
enabled_precision (Set(Union(torch.dpython:type, torch_tensorrt.dpython:type))) – The set of datatypes that TensorRT can use when selecting kernels
debug (bool) – Enable debuggable engine
capability (EngineCapability) – Restrict kernel selection to safe gpu kernels or safe dla kernels
num_avg_timing_iters (python:int) – Number of averaging timing iterations used to select kernels
workspace_size (python:int) – Maximum size of workspace given to TensorRT
dla_sram_size (python:int) – Fast software managed RAM used by DLA to communicate within a layer.
dla_local_dram_size (python:int) – Host RAM used by DLA to share intermediate tensor data across operations
dla_global_dram_size (python:int) – Host RAM used by DLA to store weights and metadata for execution
truncate_double (bool) – Truncate weights provided in double (float64) to float32
calibrator (Union(torch_tensorrt._C.IInt8Calibrator, tensorrt.IInt8Calibrator)) – Calibrator object which will provide data to the PTQ system for INT8 Calibration
require_full_compilation (bool) – Require modules to be compiled end to end or return an error as opposed to returning a hybrid graph where operations that cannot be run in TensorRT are run in PyTorch
min_block_size (python:int) – The minimum number of contiguous TensorRT convertible operations in order to run a set of operations in TensorRT
torch_executed_ops (Collection[Target]) – Set of aten operators that must be run in PyTorch. An error will be thrown if this set is not empty but require_full_compilation is True
torch_executed_modules (List[str]) – List of modules that must be run in PyTorch. An error will be thrown if this list is not empty but require_full_compilation is True
pass_through_build_failures (bool) – Error out if there are issues during compilation (only applicable to torch.compile workflows)
max_aux_stream (Optional[python:int]) – Maximum streams in the engine
version_compatible (bool) – Build the TensorRT engines compatible with future versions of TensorRT (Restrict to lean runtime operators to provide version forward compatibility for the engines)
optimization_level – (Optional[int]): Setting a higher optimization level allows TensorRT to spend longer engine building time searching for more optimization options. The resulting engine may have better performance compared to an engine built with a lower optimization level. The default optimization level is 3. Valid values include integers from 0 to the maximum optimization level, which is currently 5. Setting it to be greater than the maximum level results in identical behavior to the maximum level.
use_python_runtime – (bool): Return a graph using a pure Python runtime, reduces options for serialization
use_fast_partitioner – (bool): Use the adjacency based partitioning scheme instead of the global partitioner. Adjacency partitioning is faster but may not be optimal. Use the global paritioner (False) if looking for best performance
enable_experimental_decompositions (bool) – Use the full set of operator decompositions. These decompositions may not be tested but serve to make the graph easier to convert to TensorRT, potentially increasing the amount of graphs run in TensorRT.
dryrun (bool) – Toggle for “Dryrun” mode, running everything except conversion to TRT and logging outputs
hardware_compatible (bool) – Build the TensorRT engines compatible with GPU architectures other than that of the GPU on which the engine was built (currently works for NVIDIA Ampere and newer)
timing_cache_path (str) – Path to the timing cache if it exists (or) where it will be saved after compilation
lazy_engine_init (bool) – Defer setting up engines until the compilation of all engines is complete. Can allow larger models with multiple graph breaks to compile but can lead to oversubscription of GPU memory at runtime.
cache_built_engines (bool) – Whether to save the compiled TRT engines to storage
reuse_cached_engines (bool) – Whether to load the compiled TRT engines from storage
engine_cache_dir (Optional[str]) – Directory to store the cached TRT engines
engine_cache_size (Optional[python:int]) – Maximum hard-disk space (bytes) to use for the engine cache, default is 1GB. If the cache exceeds this size, the oldest engines will be removed by default
custom_engine_cache (Optional[BaseEngineCache]) – Engine cache instance to use for saving and loading engines. Users can provide their own engine cache by inheriting from BaseEngineCache. If used, engine_cache_dir and engine_cache_size will be ignored.
use_explicit_typing (bool) – This flag enables strong typing in TensorRT compilation which respects the precisions set in the Pytorch model. This is useful when users have mixed precision graphs.
use_fp32_acc (bool) – This option inserts cast to FP32 nodes around matmul layers and TensorRT ensures the accumulation of matmul happens in FP32. Use this only when FP16 precision is configured in enabled_precisions.
refit_identical_engine_weights (bool) – Refit engines with identical weights. This is useful when the same model is compiled multiple times with different inputs and the weights are the same. This will save time by reusing the same engine for different inputs.
strip_engine_weights (bool) – Strip engine weights from the serialized engine. This is useful when the engine is to be deployed in an environment where the weights are not required.
immutable_weights (bool) – Build non-refittable engines. This is useful for some layers that are not refittable. If this argument is set to true, strip_engine_weights and refit_identical_engine_weights will be ignored.
enable_weight_streaming (bool) – Enable weight streaming.
tiling_optimization_level (str) – The optimization level of tiling strategies. A higher level allows TensorRT to spend more time searching for better tiling strategy. We currently support [“none”, “fast”, “moderate”, “full”].
l2_limit_for_tiling (python:int) – The target L2 cache usage limit (in bytes) for tiling optimization (default is -1 which means no limit).
**kwargs – Any,

Returns

Compiled FX Module, when run it will execute via TensorRT

Return type

torch.fx.GraphModule

torch_tensorrt.dynamo.trace(mod: torch.nn.modules.module.Module | torch.fx.graph_module.GraphModule, inputs: Optional[Tuple[Any, ...]] = None, *, arg_inputs: Optional[Tuple[Any, ...]] = None, kwarg_inputs: Optional[dict[Any, Any]] = None, **kwargs: Any) → ExportedProgram[source]

Exports a torch.export.ExportedProgram from a torch.nn.Module or torch.fx.GraphModule specifically targeting being compiled with Torch-TensorRT

Exports a torch.export.ExportedProgram from either a torch.nn.Module or torch.fx.GraphModule``. Runs specific operator decompositions geared towards compilation by Torch-TensorRT’s dynamo frontend.

Parameters

mod (torch.nn.Module | torch.fx.GraphModule) – Source module to later be compiled by Torch-TensorRT’s dynamo fronted

inputs (Tuple[Any, ...]) –

List of specifications of input shape, dtype and memory layout for inputs to the module. This argument is required. Input Sizes can be specified as torch sizes, tuples or lists. dtypes can be specified using torch datatypes or torch_tensorrt datatypes and you can use either torch devices or the torch_tensorrt device type enum to select device type.

input=[
    torch_tensorrt.Input((1, 3, 224, 224)), # Static NCHW input shape for input #1
    torch_tensorrt.Input(
        min_shape=(1, 224, 224, 3),
        opt_shape=(1, 512, 512, 3),
        max_shape=(1, 1024, 1024, 3),
        dtype=torch.int32
        format=torch.channel_last
    ), # Dynamic input shape for input #2
    torch.randn((1, 3, 224, 244)) # Use an example tensor and let torch_tensorrt infer settings
]

Keyword Arguments

arg_inputs (Tuple[Any, ...]) – Same as inputs. Alias for better understanding with kwarg_inputs.
kwarg_inputs (dict[Any, ...]) – Optional, kwarg inputs to the module forward function.
device (Union(torch.device, dict)) –
Target device for TensorRT engines to run on
```
device=torch.device("cuda:0")
```
debug (bool) – Enable debuggable engine
enable_experimental_decompositions (bool) – Use the full set of operator decompositions. These decompositions may not be tested but serve to make the graph easier to convert to TensorRT, potentially increasing the amount of graphs run in TensorRT.
**kwargs – Any,

Returns

Compiled FX Module, when run it will execute via TensorRT

Return type

torch.fx.GraphModule

torch_tensorrt.dynamo.export(gm: GraphModule, cross_compile_flag: Optional[bool] = False) → ExportedProgram[source]

Export the result of TensorRT compilation into the desired output format.

Parameters

gm (torch.fx.GraphModule) – Compiled Torch-TensorRT module, generated by torch_tensorrt.dynamo.compile
inputs (torch.Tensor) – Torch input tensors
cross_compile_flag (bool) – Flag to indicated whether it is cross_compilation enabled or not

torch_tensorrt.dynamo.convert_exported_program_to_serialized_trt_engine(exported_program: ExportedProgram, inputs: Optional[Sequence[Sequence[Any]]] = None, *, arg_inputs: Optional[Sequence[Sequence[Any]]] = None, kwarg_inputs: Optional[dict[Any, Any]] = None, enabled_precisions: Union[Set[Union[dtype, dtype]], Tuple[Union[dtype, dtype]]] = {dtype.f32}, debug: bool = False, assume_dynamic_shape_support: bool = False, workspace_size: int = 0, min_block_size: int = 5, torch_executed_ops: Optional[Set[str]] = None, pass_through_build_failures: bool = False, max_aux_streams: Optional[int] = None, version_compatible: bool = False, optimization_level: Optional[int] = None, use_python_runtime: Optional[bool] = False, truncate_double: bool = False, use_fast_partitioner: bool = True, enable_experimental_decompositions: bool = False, device: Device = Device(type=DeviceType.GPU, gpu_id=0), require_full_compilation: bool = False, disable_tf32: bool = False, sparse_weights: bool = False, engine_capability: EngineCapability = EngineCapability.STANDARD, num_avg_timing_iters: int = 1, dla_sram_size: int = 1048576, dla_local_dram_size: int = 1073741824, dla_global_dram_size: int = 536870912, calibrator: object = None, allow_shape_tensors: bool = False, timing_cache_path: str = '/tmp/torch_tensorrt_engine_cache/timing_cache.bin', use_explicit_typing: bool = False, use_fp32_acc: bool = False, refit_identical_engine_weights: bool = False, strip_engine_weights: bool = False, immutable_weights: bool = True, enable_weight_streaming: bool = False, tiling_optimization_level: str = 'none', l2_limit_for_tiling: int = - 1, **kwargs: Any) → bytes[source]

Convert an ExportedProgram to a serialized TensorRT engine

Converts an ExportedProgram to a serialized TensorRT engine given a dictionary of conversion settings

Parameters

exported_program (torch.export.ExportedProgram) – Source module

Keyword Arguments

inputs (Optional[Sequence[Input | torch.Tensor]]) –

Required List of specifications of input shape, dtype and memory layout for inputs to the module. This argument is required. Input Sizes can be specified as torch sizes, tuples or lists. dtypes can be specified using torch datatypes or torch_tensorrt datatypes and you can use either torch devices or the torch_tensorrt device type enum to select device type.

inputs=[
      torch_tensorrt.Input((1, 3, 224, 224)), # Static NCHW input shape for input #1
      torch_tensorrt.Input(
          min_shape=(1, 224, 224, 3),
          opt_shape=(1, 512, 512, 3),
          max_shape=(1, 1024, 1024, 3),
          dtype=torch.int32
          format=torch.channel_last
      ), # Dynamic input shape for input #2
      torch.randn((1, 3, 224, 244)) # Use an example tensor and let torch_tensorrt infer settings
  ]

enabled_precisions (Optional[Set[torch.dpython:type | _enums.dpython:type]]) – The set of datatypes that TensorRT can use
debug (bool) – Whether to print out verbose debugging information
workspace_size (python:int) – Workspace TRT is allowed to use for the module (0 is default)
min_block_size (python:int) – Minimum number of operators per TRT-Engine Block
torch_executed_ops (Set[str]) – Set of operations to run in Torch, regardless of converter coverage
pass_through_build_failures (bool) – Whether to fail on TRT engine build errors (True) or not (False)
max_aux_streams (Optional[python:int]) – Maximum number of allowed auxiliary TRT streams for each engine
version_compatible (bool) – Provide version forward-compatibility for engine plan files
optimization_level (Optional[python:int]) – Builder optimization 0-5, higher levels imply longer build time, searching for more optimization options. TRT defaults to 3
use_python_runtime (Optional[bool]) – Whether to strictly use Python runtime or C++ runtime. To auto-select a runtime based on C++ dependency presence (preferentially choosing C++ runtime if available), leave the argument as None
truncate_double (bool) – Whether to truncate float64 TRT engine inputs or weights to float32
use_fast_partitioner (bool) – Whether to use the fast or global graph partitioning system
enable_experimental_decompositions (bool) – Whether to enable all core aten decompositions or only a selected subset of them
device (Device) – GPU to compile the model on
require_full_compilation (bool) – Whether to require the graph is fully compiled in TensorRT. Only applicable for ir=”dynamo”; has no effect for torch.compile path
disable_tf32 (bool) – Whether to disable TF32 computation for TRT layers
sparse_weights (bool) – Whether to allow the builder to use sparse weights
engine_capability (trt.EngineCapability) – Restrict kernel selection to safe gpu kernels or safe dla kernels
num_avg_timing_iters (python:int) – Number of averaging timing iterations used to select kernels
dla_sram_size (python:int) – Fast software managed RAM used by DLA to communicate within a layer.
dla_local_dram_size (python:int) – Host RAM used by DLA to share intermediate tensor data across operations
dla_global_dram_size (python:int) – Host RAM used by DLA to store weights and metadata for execution
calibrator (Union(torch_tensorrt._C.IInt8Calibrator, tensorrt.IInt8Calibrator)) – Calibrator object which will provide data to the PTQ system for INT8 Calibration
allow_shape_tensors – (Experimental) Allow aten::size to output shape tensors using IShapeLayer in TensorRT
timing_cache_path (str) – Path to the timing cache if it exists (or) where it will be saved after compilation
use_explicit_typing (bool) – This flag enables strong typing in TensorRT compilation which respects the precisions set in the Pytorch model. This is useful when users have mixed precision graphs.
use_fp32_acc (bool) – This option inserts cast to FP32 nodes around matmul layers and TensorRT ensures the accumulation of matmul happens in FP32. Use this only when FP16 precision is configured in enabled_precisions.
refit_identical_engine_weights (bool) – Refit engines with identical weights. This is useful when the same model is compiled multiple times with different inputs and the weights are the same. This will save time by reusing the same engine for different inputs.
strip_engine_weights (bool) – Strip engine weights from the serialized engine. This is useful when the engine is to be deployed in an environment where the weights are not required.
immutable_weights (bool) – Build non-refittable engines. This is useful for some layers that are not refittable. If this argument is set to true, strip_engine_weights and refit_identical_engine_weights will be ignored.
enable_weight_streaming (bool) – Enable weight streaming.
tiling_optimization_level (str) – The optimization level of tiling strategies. A higher level allows TensorRT to spend more time searching for better tiling strategy. We currently support [“none”, “fast”, “moderate”, “full”].
l2_limit_for_tiling (python:int) – The target L2 cache usage limit (in bytes) for tiling optimization (default is -1 which means no limit).

Returns

Serialized TensorRT engine, can either be saved to a file or deserialized via TensorRT APIs

Return type

bytes

torch_tensorrt.dynamo.cross_compile_for_windows(exported_program: ExportedProgram, inputs: Optional[Sequence[Sequence[Any]]] = None, *, arg_inputs: Optional[Sequence[Sequence[Any]]] = None, kwarg_inputs: Optional[dict[Any, Any]] = None, device: Optional[Union[Device, device, str]] = None, disable_tf32: bool = False, assume_dynamic_shape_support: bool = False, sparse_weights: bool = False, enabled_precisions: Union[Set[Union[dtype, dtype]], Tuple[Union[dtype, dtype]]] = {dtype.f32}, engine_capability: EngineCapability = EngineCapability.STANDARD, debug: bool = False, num_avg_timing_iters: int = 1, workspace_size: int = 0, dla_sram_size: int = 1048576, dla_local_dram_size: int = 1073741824, dla_global_dram_size: int = 536870912, truncate_double: bool = False, require_full_compilation: bool = False, min_block_size: int = 5, torch_executed_ops: Optional[Collection[Union[Callable[[...], Any], str]]] = None, torch_executed_modules: Optional[List[str]] = None, pass_through_build_failures: bool = False, max_aux_streams: Optional[int] = None, version_compatible: bool = False, optimization_level: Optional[int] = None, use_python_runtime: bool = False, use_fast_partitioner: bool = True, enable_experimental_decompositions: bool = False, dryrun: bool = False, hardware_compatible: bool = False, timing_cache_path: str = '/tmp/torch_tensorrt_engine_cache/timing_cache.bin', lazy_engine_init: bool = False, cache_built_engines: bool = False, reuse_cached_engines: bool = False, engine_cache_dir: str = '/tmp/torch_tensorrt_engine_cache', engine_cache_size: int = 5368709120, custom_engine_cache: Optional[BaseEngineCache] = None, use_explicit_typing: bool = False, use_fp32_acc: bool = False, refit_identical_engine_weights: bool = False, strip_engine_weights: bool = False, immutable_weights: bool = True, enable_weight_streaming: bool = False, tiling_optimization_level: str = 'none', l2_limit_for_tiling: int = - 1, **kwargs: Any) → GraphModule[source]

Compile an ExportedProgram module using TensorRT in Linux for Inference in Windows

Takes an exported program and a set of settings to configure the compiler and it will convert methods to AOT graphs which call equivalent TensorRT engines

Parameters

exported_program (torch.export.ExportedProgram) – Source module, running torch.export on a torch.nn.Module

inputs (Tuple[Any, ...]) –

List of specifications of input shape, dtype and memory layout for inputs to the module. This argument is required. Input Sizes can be specified as torch sizes, tuples or lists. dtypes can be specified using torch datatypes or torch_tensorrt datatypes and you can use either torch devices or the torch_tensorrt device type enum to select device type.

inputs=[
    torch_tensorrt.Input((1, 3, 224, 224)), # Static NCHW input shape for input #1
    torch_tensorrt.Input(
        min_shape=(1, 224, 224, 3),
        opt_shape=(1, 512, 512, 3),
        max_shape=(1, 1024, 1024, 3),
        dtype=torch.int32
        format=torch.channel_last
    ), # Dynamic input shape for input #2
    torch.randn((1, 3, 224, 244)) # Use an example tensor and let torch_tensorrt infer settings
]

Keyword Arguments

arg_inputs (Tuple[Any, ...]) – Same as inputs. Alias for better understanding with kwarg_inputs.
kwarg_inputs (dict[Any, ...]) – Optional, kwarg inputs to the module forward function.
device (Union(Device, torch.device, dict)) –
Target device for TensorRT engines to run on
```
device=torch_tensorrt.Device("dla:1", allow_gpu_fallback=True)
```
disable_tf32 (bool) – Force FP32 layers to use traditional as FP32 format vs the default behavior of rounding the inputs to 10-bit mantissas before multiplying, but accumulates the sum using 23-bit mantissas
assume_dynamic_shape_support (bool) – Setting this to true enables the converters work for both dynamic and static shapes. Default: False
sparse_weights (bool) – Enable sparsity for convolution and fully connected layers.
enabled_precision (Set(Union(torch.dpython:type, torch_tensorrt.dpython:type))) – The set of datatypes that TensorRT can use when selecting kernels
debug (bool) – Enable debuggable engine
capability (EngineCapability) – Restrict kernel selection to safe gpu kernels or safe dla kernels
num_avg_timing_iters (python:int) – Number of averaging timing iterations used to select kernels
workspace_size (python:int) – Maximum size of workspace given to TensorRT
dla_sram_size (python:int) – Fast software managed RAM used by DLA to communicate within a layer.
dla_local_dram_size (python:int) – Host RAM used by DLA to share intermediate tensor data across operations
dla_global_dram_size (python:int) – Host RAM used by DLA to store weights and metadata for execution
truncate_double (bool) – Truncate weights provided in double (float64) to float32
calibrator (Union(torch_tensorrt._C.IInt8Calibrator, tensorrt.IInt8Calibrator)) – Calibrator object which will provide data to the PTQ system for INT8 Calibration
require_full_compilation (bool) – Require modules to be compiled end to end or return an error as opposed to returning a hybrid graph where operations that cannot be run in TensorRT are run in PyTorch
min_block_size (python:int) – The minimum number of contiguous TensorRT convertible operations in order to run a set of operations in TensorRT
torch_executed_ops (Collection[Target]) – Set of aten operators that must be run in PyTorch. An error will be thrown if this set is not empty but require_full_compilation is True
torch_executed_modules (List[str]) – List of modules that must be run in PyTorch. An error will be thrown if this list is not empty but require_full_compilation is True
pass_through_build_failures (bool) – Error out if there are issues during compilation (only applicable to torch.compile workflows)
max_aux_stream (Optional[python:int]) – Maximum streams in the engine
version_compatible (bool) – Build the TensorRT engines compatible with future versions of TensorRT (Restrict to lean runtime operators to provide version forward compatibility for the engines)
optimization_level – (Optional[int]): Setting a higher optimization level allows TensorRT to spend longer engine building time searching for more optimization options. The resulting engine may have better performance compared to an engine built with a lower optimization level. The default optimization level is 3. Valid values include integers from 0 to the maximum optimization level, which is currently 5. Setting it to be greater than the maximum level results in identical behavior to the maximum level.
use_python_runtime – (bool): Return a graph using a pure Python runtime, reduces options for serialization
use_fast_partitioner – (bool): Use the adjacency based partitioning scheme instead of the global partitioner. Adjacency partitioning is faster but may not be optimal. Use the global paritioner (False) if looking for best performance
enable_experimental_decompositions (bool) – Use the full set of operator decompositions. These decompositions may not be tested but serve to make the graph easier to convert to TensorRT, potentially increasing the amount of graphs run in TensorRT.
dryrun (bool) – Toggle for “Dryrun” mode, running everything except conversion to TRT and logging outputs
hardware_compatible (bool) – Build the TensorRT engines compatible with GPU architectures other than that of the GPU on which the engine was built (currently works for NVIDIA Ampere and newer)
timing_cache_path (str) – Path to the timing cache if it exists (or) where it will be saved after compilation
lazy_engine_init (bool) – Defer setting up engines until the compilation of all engines is complete. Can allow larger models with multiple graph breaks to compile but can lead to oversubscription of GPU memory at runtime.
cache_built_engines (bool) – Whether to save the compiled TRT engines to storage
reuse_cached_engines (bool) – Whether to load the compiled TRT engines from storage
engine_cache_dir (Optional[str]) – Directory to store the cached TRT engines
engine_cache_size (Optional[python:int]) – Maximum hard-disk space (bytes) to use for the engine cache, default is 1GB. If the cache exceeds this size, the oldest engines will be removed by default
custom_engine_cache (Optional[BaseEngineCache]) – Engine cache instance to use for saving and loading engines. Users can provide their own engine cache by inheriting from BaseEngineCache. If used, engine_cache_dir and engine_cache_size will be ignored.
use_explicit_typing (bool) – This flag enables strong typing in TensorRT compilation which respects the precisions set in the Pytorch model. This is useful when users have mixed precision graphs.
use_fp32_acc (bool) – This option inserts cast to FP32 nodes around matmul layers and TensorRT ensures the accumulation of matmul happens in FP32. Use this only when FP16 precision is configured in enabled_precisions.
refit_identical_engine_weights (bool) – Refit engines with identical weights. This is useful when the same model is compiled multiple times with different inputs and the weights are the same. This will save time by reusing the same engine for different inputs.
strip_engine_weights (bool) – Strip engine weights from the serialized engine. This is useful when the engine is to be deployed in an environment where the weights are not required.
immutable_weights (bool) – Build non-refittable engines. This is useful for some layers that are not refittable. If this argument is set to true, strip_engine_weights and refit_identical_engine_weights will be ignored.
enable_weight_streaming (bool) – Enable weight streaming.
tiling_optimization_level (str) – The optimization level of tiling strategies. A higher level allows TensorRT to spend more time searching for better tiling strategy. We currently support [“none”, “fast”, “moderate”, “full”].
l2_limit_for_tiling (python:int) – The target L2 cache usage limit (in bytes) for tiling optimization (default is -1 which means no limit).
**kwargs – Any,

Returns

Compiled FX Module, when run it will execute via TensorRT

Return type

torch.fx.GraphModule

torch_tensorrt.dynamo.save_cross_compiled_exported_program(gm: GraphModule, file_path: str) → None[source]

Save cross compiled exported program to disk.

Parameters

module (torch.fx.GraphModule) – Cross compiled Torch-TensorRT module
file_path (str) – the file path where the exported program will be saved to disk

torch_tensorrt.dynamo.load_cross_compiled_exported_program(file_path: str = '') → Any[source]

Load an ExportedProgram file in Windows which was previously cross compiled in Linux

Parameters: file_path (str) – Path to file on the disk
Raises: ValueError – If the api is not called in windows or there is no file or the file is a valid ExportedProgram file

torch_tensorrt.dynamo.refit_module_weights(compiled_module: torch.fx.graph_module.GraphModule | torch.export.exported_program.ExportedProgram, new_weight_module: ExportedProgram, arg_inputs: Optional[Tuple[Any, ...]] = None, kwarg_inputs: Optional[dict[str, Any]] = None, verify_output: bool = False, use_weight_map_cache: bool = True, in_place: bool = False) → GraphModule[source]

Refit a compiled graph module with ExportedProgram. This performs weight updates in compiled_module without recompiling the engine.

Parameters

compiled_module – compiled TensorRT module that needs to be refitted. This compiled_module should be compmiled by torch_tensorrt.dynamo.compile or load it from disk using trt.load.
new_weight_module – exported program with the updated weights. This one should have the same model architecture as the compiled module.
arg_inputs – sample arg inputs. Optional, needed if output check
kwarg_inputs – sample kwarg inputs. Optional, needed if output check
verify_output – whether to verify output of refitted module

Returns

A new compiled TensorRT module that has the updated weights.

Classes

class torch_tensorrt.dynamo.CompilationSettings(enabled_precisions: ~typing.Set[~torch_tensorrt._enums.dtype] = <factory>, debug: bool = False, workspace_size: int = 0, min_block_size: int = 5, torch_executed_ops: ~typing.Collection[~typing.Union[~typing.Callable[[...], ~typing.Any], str]] = <factory>, pass_through_build_failures: bool = False, max_aux_streams: ~typing.Optional[int] = None, version_compatible: bool = False, optimization_level: ~typing.Optional[int] = None, use_python_runtime: ~typing.Optional[bool] = False, truncate_double: bool = False, use_fast_partitioner: bool = True, enable_experimental_decompositions: bool = False, device: ~torch_tensorrt._Device.Device = <factory>, require_full_compilation: bool = False, disable_tf32: bool = False, assume_dynamic_shape_support: bool = False, sparse_weights: bool = False, engine_capability: ~torch_tensorrt._enums.EngineCapability = <factory>, num_avg_timing_iters: int = 1, dla_sram_size: int = 1048576, dla_local_dram_size: int = 1073741824, dla_global_dram_size: int = 536870912, dryrun: ~typing.Union[bool, str] = False, hardware_compatible: bool = False, timing_cache_path: str = '/tmp/torch_tensorrt_engine_cache/timing_cache.bin', lazy_engine_init: bool = False, cache_built_engines: bool = False, reuse_cached_engines: bool = False, use_explicit_typing: bool = False, use_fp32_acc: bool = False, refit_identical_engine_weights: bool = False, strip_engine_weights: bool = False, immutable_weights: bool = True, enable_weight_streaming: bool = False, enable_cross_compile_for_windows: bool = False, tiling_optimization_level: str = 'none', l2_limit_for_tiling: int = -1, use_distributed_mode_trace: bool = False)[source]

Compilation settings for Torch-TensorRT Dynamo Paths

Parameters

enabled_precisions (Set[dpython:type]) – Available kernel dtype precisions
debug (bool) – Whether to print out verbose debugging information
workspace_size (python:int) – Workspace TRT is allowed to use for the module (0 is default)
min_block_size (python:int) – Minimum number of operators per TRT-Engine Block
torch_executed_ops (Collection[Target]) – Collection of operations to run in Torch, regardless of converter coverage
pass_through_build_failures (bool) – Whether to fail on TRT engine build errors (True) or not (False)
max_aux_streams (Optional[python:int]) – Maximum number of allowed auxiliary TRT streams for each engine
version_compatible (bool) – Provide version forward-compatibility for engine plan files
optimization_level (Optional[python:int]) – Builder optimization 0-5, higher levels imply longer build time, searching for more optimization options. TRT defaults to 3
use_python_runtime (Optional[bool]) – Whether to strictly use Python runtime or C++ runtime. To auto-select a runtime based on C++ dependency presence (preferentially choosing C++ runtime if available), leave the argument as None
truncate_double (bool) – Whether to truncate float64 TRT engine inputs or weights to float32
use_fast_partitioner (bool) – Whether to use the fast or global graph partitioning system
enable_experimental_decompositions (bool) – Whether to enable all core aten decompositions or only a selected subset of them
device (Device) – GPU to compile the model on
require_full_compilation (bool) – Whether to require the graph is fully compiled in TensorRT. Only applicable for ir=”dynamo”; has no effect for torch.compile path
assume_dynamic_shape_support (bool) – Setting this to true enables the converters work for both dynamic and static shapes. Default: False
disable_tf32 (bool) – Whether to disable TF32 computation for TRT layers
sparse_weights (bool) – Whether to allow the builder to use sparse weights
engine_capability (trt.EngineCapability) – Restrict kernel selection to safe gpu kernels or safe dla kernels
num_avg_timing_iters (python:int) – Number of averaging timing iterations used to select kernels
dla_sram_size (python:int) – Fast software managed RAM used by DLA to communicate within a layer.
dla_local_dram_size (python:int) – Host RAM used by DLA to share intermediate tensor data across operations
dla_global_dram_size (python:int) – Host RAM used by DLA to store weights and metadata for execution
dryrun (Union[bool, str]) – Toggle “Dryrun” mode, which runs everything through partitioning, short of conversion to TRT Engines. Prints detailed logs of the graph structure and nature of partitioning. Optionally saves the output to a file if a string path is specified
hardware_compatible (bool) – Build the TensorRT engines compatible with GPU architectures other than that of the GPU on which the engine was built (currently works for NVIDIA Ampere and newer)
timing_cache_path (str) – Path to the timing cache if it exists (or) where it will be saved after compilation
cache_built_engines (bool) – Whether to save the compiled TRT engines to storage
reuse_cached_engines (bool) – Whether to load the compiled TRT engines from storage
use_strong_typing (bool) – This flag enables strong typing in TensorRT compilation which respects the precisions set in the Pytorch model. This is useful when users have mixed precision graphs.
use_fp32_acc (bool) – This option inserts cast to FP32 nodes around matmul layers and TensorRT ensures the accumulation of matmul happens in FP32. Use this only when FP16 precision is configured in enabled_precisions.
refit_identical_engine_weights (bool) – Whether to refit the engine with identical weights
strip_engine_weights (bool) – Whether to strip the engine weights
immutable_weights (bool) – Build non-refittable engines. This is useful for some layers that are not refittable. If this argument is set to true, strip_engine_weights and refit_identical_engine_weights will be ignored
enable_weight_streaming (bool) – Enable weight streaming.
enable_cross_compile_for_windows (bool) – By default this is False means TensorRT engines can only be executed on the same platform where they were built. True will enable cross-platform compatibility which allows the engine to be built on Linux and run on Windows
tiling_optimization_level (str) – The optimization level of tiling strategies. A higher level allows TensorRT to spend more time searching for better tiling strategy. We currently support [“none”, “fast”, “moderate”, “full”].
l2_limit_for_tiling (python:int) – The target L2 cache usage limit (in bytes) for tiling optimization (default is -1 which means no limit).
use_distributed_mode_trace (bool) – Using aot_autograd to trace the graph. This is enabled when DTensors or distributed tensors are present in distributed model

torch_tensorrt.dynamo

Functions

Classes

Docs

Tutorials

Resources