Deploying PyTorch Models to the Micro-Edge with ExecuTorch and Arm

The world of AI is expanding beyond the cloud, reaching devices that fit in the palm of your hand. Running PyTorch models on these tiny systems, where memory is measured in kilobytes, requires a new way of thinking. That’s where ExecuTorch, the lightweight runtime for edge inference, bridges the gap between familiar PyTorch workflows and low-power Arm-based microcontrollers, using optimizations such as quantization and graph compilation to make models efficient enough for the edge.

I recently built a Tiny Rock-Paper-Scissors (RPS) demo using PyTorch and ExecuTorch on the Arm Corstone-320 platform. The goal: take a small Convolutional Neural Network (CNN) trained in PyTorch and deploy it all the way to a simulated Arm microcontroller with an Arm Ethos-U NPU (via the Arm Fixed Virtual Platform (FVP)). Here’s what that journey looks like, and why it matters for anyone building at the edge.

Why PyTorch at the Edge?

PyTorch makes model experimentation fast and intuitive, but moving from the flexibility of dynamic graphs to the rigid constraints of embedded hardware isn’t trivial. Most microcontrollers have less than 1 MB of RAM and no operating system, so traditional Python inference is off the table.

ExecuTorch solves this by compiling PyTorch models into a compact, portable format (`.pte`) that runs on devices with minimal compute, power, and memory. During this process, weights and activations are quantized from floating-point to lower-precision integer formats (typically int8), dramatically reducing both memory footprint and compute costs while maintaining model accuracy. The computation graph is also flattened, fused, and optimized, removing redundant operations and enabling smooth execution at the edge. It extends the PyTorch ecosystem all the way down to the smallest Arm Cortex-M and Ethos-U-based systems.

From PyTorch to the Micro-Edge

The great news is, I have built a detailed learning path to guide you through an end-to-end TinyML EdgeAI pipeline.

The Tiny RPS Game

The course’s centerpiece is the Tiny RPS game. It’s a fun and approachable way to learn about TinyML, while showing that PyTorch workflows can scale down just as easily as they scale up. It is a minimal but complete AI workflow which:

Generates its own dataset.
Trains a CNN in PyTorch.
Exports it via ExecuTorch.
Deploys it to the FVP, no need for physical hardware.
All you need is an x86 Linux host machine or VM running Ubuntu 22.04 or later.

The Pipeline

Model Training in PyTorch

We define and train a compact CNN to classify synthetic images of “rock,” “paper,” and “scissors.” Each class is rendered as a noisy 28×28 grayscale image of its first letter (“R”, “P”, or “S”) to simulate data variation. (See Learning Path for detailed script)

```python

import torch
import torch.nn as nn

class TinyRPS(nn.Module):
    """
    Simple ConvNet:
    [B,1,28,28] -> Conv3x3(16) -> ReLU -> Conv3x3(32) -> ReLU
      -> MaxPool2d(2) -> Conv3x3(64) -> ReLU -> MaxPool2d(2)
      -> flatten -> Linear(128) -> ReLU -> Linear(3)
    """

    def __init__(self):
        super().__init__()
        self.body = nn.Sequential(
            nn.Conv2d(1, 16, 3, padding=1),
            nn.ReLU(inplace=True),
            nn.Conv2d(16, 32, 3, padding=1),
            nn.ReLU(inplace=True),
            nn.MaxPool2d(2),
            nn.Conv2d(32, 64, 3, padding=1),
            nn.ReLU(inplace=True),
            nn.MaxPool2d(2),
        )
        self.head = nn.Sequential(
            nn.Flatten(),
            nn.Linear(64 * 7 * 7, 128),
            nn.ReLU(inplace=True),
            nn.Linear(128, 3),
        )

    def forward(self, x):
        x = self.body(x)
        x = self.head(x)
        return x
```

This architecture is compact and Ethos-friendly, ideal for deployment to the micro-edge. Training uses Adam with a small synthetic dataset and achieves over 95% validation accuracy after a few epochs.

Exporting to ExecuTorch

Once trained, the model is exported to an ExecuTorch `.pte` program. This format is optimized for execution without Python on devices running tiny embedded runtimes. (See Learning Path for detailed script)

```python

from executorch import exir
from torch.export import export

def export_to_pte(model: nn.Module, out_path: str, img_size: int) -> None:
    model.eval()
    example = torch.zeros(
        1, 1, img_size, img_size,
        dtype=torch.float32
    )

    # Export with PyTorch’s exporter
    exported = export(model, (example,))
    edge = exir.to_edge(exported)
    prog = edge.to_executorch()

    with open(out_path, "wb") as f:
        f.write(prog.buffer)

    print(f"[export] wrote {out_path}")
```

This step effectively converts your PyTorch computation graph into a static, memory-efficient graph that can run on microcontrollers with minimal overhead.

Deployment on Arm Corstone-320 FVP

The `.pte` file is deployed on the Arm Corstone-320 FVP, a software simulation of a Cortex-M CPU paired with an Ethos-U microNPU. This allows developers to run and validate their model locally before flashing it to real hardware. The RPS game lets you play interactively in the terminal, demonstrating real-time on-device inference.

Lessons Learned

Working on this demo revealed that PyTorch’s flexibility doesn’t have to stop at the data center. ExecuTorch makes it possible to bring the same familiar PyTorch workflow to IoT sensors, wearables, and embedded devices, enabling privacy-preserving, low-power AI anywhere.

Edge AI may be small in size, but it’s huge in potential.

Try It Yourself

Learning Path: Edge AI with PyTorch & ExecuTorch – Tiny RPS on Arm Target Audience: ML developers and embedded engineers with basic PyTorch experience. Prerequisite: Introduction to TinyML on Arm

Acknowledgements

This learning path was a collaborative effort, and I owe a special thanks to the team that helped bring this course to life, including the valuable contributions of Annie Tallund, Zingo Andersen, George Gekov, Gemma Paris, Adrian Lundell, Madeline Underwood, Mary Bennion, and Fredrik Knutsson.

Explore other help for Edge AI development on low-power, resource-constrained devices using Arm Ethos-U NPUs

Deploying PyTorch Models to the Micro-Edge with ExecuTorch and Arm

Docs

Tutorials

Resources

Stay in touch for updates, event info, and the latest news