Lowering a Model as a Delegate

Audience: ML Engineers, who are interested in applying delegates to accelerate their program in runtime.

Backend delegation is an entry point for backends to process and execute PyTorch programs to leverage the performance and efficiency benefits of specialized backends and hardware, while still providing PyTorch users with an experience close to that of the PyTorch runtime. The backend delegate is usually either provided by ExecuTorch or vendors. The way to leverage delegation in your program is via a standard entry point to_backend.

Frontend Interfaces

There are three flows for delegating a program to a backend:

Lower the whole module to a backend. This is good for testing backends and the preprocessing stage.
Lower the whole module to a backend and compose it with another module. This is good for reusing lowered modules exported from other flows.
Lower parts of a module according to a partitioner. This is good for lowering models that include both lowerable and non-lowerable nodes, and is the most streamlined procecss.

Flow 1: Lowering the whole module

This flow starts from a traced graph module with Edge Dialect representation. To lower it, we call the following function which returns a LoweredBackendModule (more documentation on this function can be found in the Export API reference)

# defined in backend_api.py
def to_backend(
    backend_id: str,
    edge_program: ExportedProgram,
    compile_spec: List[CompileSpec],
) -> LoweredBackendModule:

Within this function, the backend’s preprocess() function is called which produces a compiled blob which will be emitted to the flatbuffer binary. The lowered module can be directly captured, or be put back in a parent module to be captured. Eventually the captured module is serialized in the flatbuffer’s model that can be loaded by the runtime.

The following is an example of this flow:

from executorch.exir.backend.backend_api import to_backend
import executorch.exir as exir
import torch
from torch.export import export
from executorch.exir import to_edge

# The submodule runs in a specific backend. In this example,  `BackendWithCompilerDemo` backend
class LowerableSubModel(torch.nn.Module):
    def __init__(self):
        super().__init__()

    def forward(self, x):
        return torch.sin(x)

# Convert the lowerable module to Edge IR Representation
to_be_lowered = LowerableSubModel()
example_input = (torch.ones(1), )
to_be_lowered_exir_submodule = to_edge(export(to_be_lowered, example_input))

# Import the backend implementation
from executorch.exir.backend.test.backend_with_compiler_demo import (
    BackendWithCompilerDemo,
)
lowered_module = to_backend('BackendWithCompilerDemo', to_be_lowered_exir_submodule.exported_program(), [])

We can serialize the program to a flatbuffer format by directly running:

# Save the flatbuffer to a local file
save_path = "delegate.pte"
with open(save_path, "wb") as f:
    f.write(lowered_module.buffer())

Flow 2: Lowering the whole module and composite

Alternatively, after flow 1, we can compose this lowered module with another module:

# This submodule runs in executor runtime
class NonLowerableSubModel(torch.nn.Module):
    def __init__(self, bias):
        super().__init__()
        self.bias = bias

    def forward(self, a, b):
        return torch.add(torch.add(a, b), self.bias)


# The composite module, including lower part and non-lowerpart
class CompositeModel(torch.nn.Module):
    def __init__(self):
        super().__init__()
        self.non_lowerable = NonLowerableSubModel(torch.ones(1) * 0.3)
        self.lowerable = lowered_module

    def forward(self, x):
        a = self.lowerable(x)
        b = self.lowerable(a)
        ret = self.non_lowerable(a, b)
        return a, b, ret

composite_model = CompositeModel()
model_inputs = (torch.ones(1), )
exec_prog = to_edge(export(composite_model, model_inputs)).to_executorch()

# Save the flatbuffer to a local file
save_path = "delegate.pte"
with open(save_path, "wb") as f:
    f.write(exec_prog.buffer)

Flow 3: Partitioning

The third flow also starts from a traced graph module with Edge Dialect representation. To lower certain nodes in this graph module, we can use the overloaded to_backend function.

def to_backend(
    edge_program: ExportedProgram,
    partitioner: Partitioner,
) -> ExportedProgram:

This function takes in a Partitioner which adds a tag to all the nodes that are meant to be lowered. It will return a partition_tags dictionary mapping tags to backend names and module compile specs. The tagged nodes will then be partitioned and lowered to their mapped backends using Flow 1’s process. Available helper partitioners are documented here. These lowered modules will be inserted into the top-level module and serialized.

The following is an example of the flow:

import executorch.exir as exir
from executorch.exir.backend.backend_api import to_backend
from executorch.exir.backend.test.op_partitioner_demo import AddMulPartitionerDemo
from executorch.exir.program import (
    EdgeProgramManager,
    to_edge,
)
from torch.export import export
import torch

class Model(torch.nn.Module):
    def __init__(self):
        super().__init__()

    def forward(self, x, y):
        x = x + y
        x = x * y
        x = x - y
        x = x / y
        x = x * y
        x = x + y
        return x

model = Model()
model_inputs = (torch.randn(1, 3), torch.randn(1, 3))

core_aten_ep = export(model, model_inputs)
edge: EdgeProgramManager = to_edge(core_aten_ep)
edge = edge.to_backend(AddMulPartitionerDemo())
exec_prog = edge.to_executorch()

# Save the flatbuffer to a local file
save_path = "delegate.pte"
with open(save_path, "wb") as f:
    f.write(exec_prog.buffer)

Runtime

After having the program with delegates, to run the model with the backend, we’d need to register the backend. Depending on the delegate implementation, the backend can be registered either as part of global variables or explicitly registered inside the main function.

If it’s registered during global variables initialization, the backend will be registered as long as it’s statically linked. Users only need to include the library as part of the dependency.
If the vendor provides an API to register the backend, users need to include the library as part of the dependency, and call the API provided by vendors to explicitly register the backend as part of the main function.

Lowering a Model as a Delegate

Frontend Interfaces

Flow 1: Lowering the whole module

Flow 2: Lowering the whole module and composite

Flow 3: Partitioning

Runtime

Docs

Tutorials

Resources