OpenReg is a self-contained demonstration of a PyTorch out-of-tree backend implementation utilizing the core framework’s “PrivateUse1” mechanism. This implementation serves two primary purposes:
- Reference Implementation: Provides a practical template for third-party device vendors integrating with PyTorch through PrivateUse1.
- CI Testing Infrastructure: Enables device-agnostic testing capabilities for continuous integration pipelines.
Usage
Module Installation
cd {project}/test/cpp_extensions/open_registration_extension
python setup.py install
Use Case
import torch
import pytorch_openreg
if __name__ == "__main__":
print(torch.ones(1, 2, device='openreg'))
Architectural Overview
Process Management
OpenReg implements virtual device isolation by spawning N independent subprocesses, each maintaining dedicated request/response queues for inter-process communication. The parent process driver encapsulates device operations into command packets that are:
- Dispatched to target devices via request queues
- Processed asynchronously with results returned through response queues
Figure: Parent-Subprocess Communication Flow
Memory Management
Device memory allocations occur within individual subprocesses to ensure:
- Strict memory isolation between devices
- Realistic simulation of physical device constraints
Component Breakdown
_aten_impl.py
This module handles dual responsibilities:
- Hook Registration:
- Utilizes _IMPL_REGISTRY to bind C++ backend hooks (e.g., getDevice, getStream) to device driver implementations
- Fallback Mechanism:
- Define a new
torch.Library
that registers a fallback that will be called whenever a backend kernel for PrivateUse1 is called. It contains the logic to handle all kind of native functions, computing the output metadata, allocating it and only calling into the device daemon to perform computation
- Define a new
_device_daemon.py
Core Subsystems
- Allocators:
HostAllocator
: Manages pinned memory in parent processDeviceAllocator
: Handles device memory with tensor reconstruction capabilities
- Driver (Parent Process):
- Maintains device context (active device/streams)
- Implements device control operations:
- setDevice/getDevice
- deviceCount
- exchangeStream
- Orchestrates command execution through queue-based IPC
- Executor (Subprocess):
- Processes command types:
- Memory operations (
malloc
/free
) - Tensor computations (
run_op
) - Data transfers (
send_data
/recv_data
) - Stream/event management (primarily no-op due to CPU sync nature)
- Memory operations (
- Processes command types:
_meta_parser.py
Key Features:
- Implements serialization utilities for cross-process object transfer
- OpenRegTensorMeta class encapsulates complete tensor metadata for:
- Output tensor reconstruction
- Device-side computation preparation
Design Considerations
Execution Characteristics
- Synchronous Computation: CPU operator execution necessitates synchronous processing
- Stream/Event Semantics: Implemented as no-ops due to synchronous execution model
- Memory Isolation: Strict per-device memory boundaries enforced through subprocess allocation
This architecture enables realistic simulation of device integration while maintaining PyTorch compatibility through standard backend interfaces.