Distributed RPC Framework

The distributed RPC framework provides mechanisms for multi-machine model training through a set of primitives to allow for remote communication, and a higher-level API to automatically differentiate models split across several machines.

Design Notes

The distributed autograd design note covers the design of the RPC-based distributed autograd framework that is useful for applications such as model parallel training.

The RRef design note covers the design of the RRef (Remote REFerence) protocol used to refer to values on remote workers by the framework.


The RPC tutorial introduces users to the RPC framework and provides two example applications using torch.distributed.rpc APIs.


