by Zhenbin Lin (Huawei)

OpenReg is a self-contained demonstration of a PyTorch out-of-tree backend implementation utilizing the core framework’s “PrivateUse1” mechanism. This implementation serves two primary purposes:

  1. Reference Implementation: Provides a practical template for third-party device vendors integrating with PyTorch through PrivateUse1.
  2. CI Testing Infrastructure: Enables device-agnostic testing capabilities for continuous integration pipelines.

Usage

Module Installation

cd {project}/test/cpp_extensions/open_registration_extension
python setup.py install

Use Case

import torch
import pytorch_openreg

if __name__ == "__main__":
   print(torch.ones(1, 2, device='openreg'))

Architectural Overview

Process Management

OpenReg implements virtual device isolation by spawning N independent subprocesses, each maintaining dedicated request/response queues for inter-process communication. The parent process driver encapsulates device operations into command packets that are:

  1. Dispatched to target devices via request queues
  2. Processed asynchronously with results returned through response queues

Parent-Subprocess Communication Flow

Figure: Parent-Subprocess Communication Flow

Memory Management

Device memory allocations occur within individual subprocesses to ensure:

  1. Strict memory isolation between devices
  2. Realistic simulation of physical device constraints

Component Breakdown

_aten_impl.py

This module handles dual responsibilities:

  1. Hook Registration:
    • Utilizes _IMPL_REGISTRY to bind C++ backend hooks (e.g., getDevice, getStream) to device driver implementations
  2. Fallback Mechanism:
    • Define a new torch.Library that registers a fallback that will be called whenever a backend kernel for PrivateUse1 is called. It contains the logic to handle all kind of native functions, computing the output metadata, allocating it and only calling into the device daemon to perform computation

_device_daemon.py

Core Subsystems

  1. Allocators:
    • HostAllocator: Manages pinned memory in parent process
    • DeviceAllocator: Handles device memory with tensor reconstruction capabilities
  2. Driver (Parent Process):
    • Maintains device context (active device/streams)
    • Implements device control operations:
      • setDevice/getDevice
      • deviceCount
      • exchangeStream
    • Orchestrates command execution through queue-based IPC
  3. Executor (Subprocess):
    • Processes command types:
      • Memory operations (malloc/free)
      • Tensor computations (run_op)
      • Data transfers (send_data/recv_data)
      • Stream/event management (primarily no-op due to CPU sync nature)

_meta_parser.py

Key Features:

  • Implements serialization utilities for cross-process object transfer
  • OpenRegTensorMeta class encapsulates complete tensor metadata for:
    • Output tensor reconstruction
    • Device-side computation preparation

Design Considerations

Execution Characteristics

  • Synchronous Computation: CPU operator execution necessitates synchronous processing
  • Stream/Event Semantics: Implemented as no-ops due to synchronous execution model
  • Memory Isolation: Strict per-device memory boundaries enforced through subprocess allocation

This architecture enables realistic simulation of device integration while maintaining PyTorch compatibility through standard backend interfaces.