OpenReg: A Self-Contained PyTorch Out-of-Tree Backend Implementation Using "PrivateUse1" Mechanism

by Zhenbin Lin (Huawei)

OpenReg is a self-contained demonstration of a PyTorch out-of-tree backend implementation utilizing the core framework’s “PrivateUse1” mechanism. This implementation serves two primary purposes:

Reference Implementation: Provides a practical template for third-party device vendors integrating with PyTorch through PrivateUse1.
CI Testing Infrastructure: Enables device-agnostic testing capabilities for continuous integration pipelines.

Usage

Module Installation

cd {project}/test/cpp_extensions/open_registration_extension
python setup.py install

Use Case

import torch
import pytorch_openreg

if __name__ == "__main__":
   print(torch.ones(1, 2, device='openreg'))

Architectural Overview

Process Management

OpenReg implements virtual device isolation by spawning N independent subprocesses, each maintaining dedicated request/response queues for inter-process communication. The parent process driver encapsulates device operations into command packets that are:

Dispatched to target devices via request queues
Processed asynchronously with results returned through response queues

Parent-Subprocess Communication Flow

Figure: Parent-Subprocess Communication Flow

Memory Management

Device memory allocations occur within individual subprocesses to ensure:

Strict memory isolation between devices
Realistic simulation of physical device constraints

Component Breakdown

_aten_impl.py

This module handles dual responsibilities:

Hook Registration:
- Utilizes _IMPL_REGISTRY to bind C++ backend hooks (e.g., getDevice, getStream) to device driver implementations
Fallback Mechanism:
- Define a new torch.Library that registers a fallback that will be called whenever a backend kernel for PrivateUse1 is called. It contains the logic to handle all kind of native functions, computing the output metadata, allocating it and only calling into the device daemon to perform computation

_device_daemon.py

Core Subsystems

Allocators:
- HostAllocator: Manages pinned memory in parent process
- DeviceAllocator: Handles device memory with tensor reconstruction capabilities
Driver (Parent Process):
- Maintains device context (active device/streams)
- Implements device control operations:
  - setDevice/getDevice
  - deviceCount
  - exchangeStream
- Orchestrates command execution through queue-based IPC
Executor (Subprocess):
- Processes command types:
  - Memory operations (malloc/free)
  - Tensor computations (run_op)
  - Data transfers (send_data/recv_data)
  - Stream/event management (primarily no-op due to CPU sync nature)

_meta_parser.py

Key Features:

Implements serialization utilities for cross-process object transfer
OpenRegTensorMeta class encapsulates complete tensor metadata for:
- Output tensor reconstruction
- Device-side computation preparation

Design Considerations

Execution Characteristics

Synchronous Computation: CPU operator execution necessitates synchronous processing
Stream/Event Semantics: Implemented as no-ops due to synchronous execution model
Memory Isolation: Strict per-device memory boundaries enforced through subprocess allocation

This architecture enables realistic simulation of device integration while maintaining PyTorch compatibility through standard backend interfaces.