.. _distributed-rpc-framework:

Distributed RPC Framework
=========================

The distributed RPC framework provides mechanisms for multi-machine model
training through a set of primitives to allow for remote communication, and a
higher-level API to automatically differentiate models split across several
machines.

.. warning ::
     APIs in the RPC package are stable. There are multiple ongoing work items 
     to improve performance and error handling, which will ship in future releases.


Basics
------

The distributed RPC framework makes it easy to run functions remotely, supports
referencing remote objects without copying the real data around, and provides
autograd and optimizer APIs to transparently run backward and update parameters
across RPC boundaries. These features can be categorized into four sets of APIs.

1) **Remote Procedure Call (RPC)** supports running a function on the specified
   destination worker with the given arguments and getting the return value back
   or creating a reference to the return value. There are three main RPC APIs:
   :meth:`~torch.distributed.rpc.rpc_sync` (synchronous),
   :meth:`~torch.distributed.rpc.rpc_async` (asynchronous), and
   :meth:`~torch.distributed.rpc.remote` (asynchronous and returns a reference
   to the remote return value). Use the synchronous API if the user code cannot
   proceed without the return value. Otherwise, use the asynchronous API to get
   a future, and wait on the future when the return value is needed on the
   caller. The :meth:`~torch.distributed.rpc.remote` API is useful when the
   requirement is to create something remotely but never need to fetch it to
   the caller. Imagine the case that a driver process is setting up a parameter
   server and a trainer. The driver can create an embedding table on the
   parameter server and then share the reference to the embedding table with the
   trainer, but itself will never use the embedding table locally. In this case,
   :meth:`~torch.distributed.rpc.rpc_sync` and
   :meth:`~torch.distributed.rpc.rpc_async` are no longer appropriate, as they
   always imply that the return value will be returned to the caller
   immediately or in the future.
2) **Remote Reference (RRef)** serves as a distributed shared pointer to a local
   or remote object. It can be shared with other workers and reference counting
   will be handled transparently. Each RRef only has one owner and the object
   only lives on that owner. Non-owner workers holding RRefs can get copies of
   the object from the owner by explicitly requesting it. This is useful when
   a worker needs to access some data object, but itself is neither the creator
   (the caller of :meth:`~torch.distributed.rpc.remote`) or the owner of the
   object. The distributed optimizer, as we will discuss below, is one example
   of such use cases.
3) **Distributed Autograd** stitches together local autograd engines on all the
   workers involved in the forward pass, and automatically reach out to them
   during the backward pass to compute gradients. This is especially helpful if
   the forward pass needs to span multiple machines when conducting, e.g.,
   distributed model parallel training, parameter-server training, etc. With
   this feature, user code no longer needs to worry about how to send gradients
   across RPC boundaries and in which order should the local autograd engines
   be launched, which can become quite complicated where there are nested and
   inter-dependent RPC calls in the forward pass.
4) **Distributed Optimizer**'s constructor takes a
   :meth:`~torch.optim.Optimizer` (e.g., :meth:`~torch.optim.SGD`,
   :meth:`~torch.optim.Adagrad`, etc.) and a list of parameter RRefs, creates an
   :meth:`~torch.optim.Optimizer` instance on each distinct RRef owner, and
   updates parameters accordingly when running ``step()``. When you have
   distributed forward and backward passes, parameters and gradients will be
   scattered across multiple workers, and hence it requires an optimizer on each
   of the involved workers. Distributed Optimizer wraps all those local
   optimizers into one, and provides a concise constructor and ``step()`` API.


.. _rpc:

RPC
---

Before using RPC and distributed autograd primitives, initialization must take
place. To initialize the RPC framework we need to use
:meth:`~torch.distributed.rpc.init_rpc` which would initialize the RPC
framework, RRef framework and distributed autograd. By default, this will also
initialize the ``ProcessGroup`` (:meth:`~torch.distributed.init_process_group`)
backend for RPC communication. The ``ProcessGroup`` backend internally uses gloo
for communication.

.. automodule:: torch.distributed.rpc
.. autofunction:: init_rpc

The following APIs allow users to remotely execute functions as well as create
references (RRefs) to remote data objects. In these APIs, when passing a
``Tensor`` as an argument or a return value, the destination worker will try to
create a ``Tensor`` with the same meta (i.e., shape, stride, etc.). We
intentionally disallow transmitting CUDA tensors because it might crash if the
device lists on source and destination workers do not match. In such cases,
applications can always explicitly move the input tensors to CPU on the caller
and move it to the desired devices on the callee if necessary.

.. warning::
  TorchScript support in RPC is experimental and subject to change.

.. autofunction:: rpc_sync
.. autofunction:: rpc_async
.. autofunction:: remote
.. autofunction:: get_worker_info
.. autofunction:: shutdown
.. autoclass:: WorkerInfo
    :members:
.. autoclass:: ProcessGroupRpcBackendOptions
    :members:
    :inherited-members:

.. _rref:


RRef
----

An ``RRef`` (Remote REFerence) is a reference to a value of some type ``T``
(e.g. ``Tensor``) on a remote worker. This handle keeps the referenced remote
value alive on the owner, but there is no implication that the value will be
transferred to the local worker in the future. RRefs can be used in
multi-machine training by holding references to `nn.Modules
<https://pytorch.org/docs/stable/nn.html#torch.nn.Module>`_ that exist on
other workers, and calling the appropriate functions to retrieve or modify their
parameters during training. See :ref:`remote-reference-protocol` for more
details.

.. autoclass:: RRef
    :members:


Distributed Autograd Framework
------------------------------

This module provides an RPC-based distributed autograd framework that can be
used for applications such as model parallel training. In short, applications
may send and receive gradient recording tensors over RPC. In the forward pass,
we record when gradient recording tensors are sent over RPC and during the
backward pass we use this information to perform a distributed backward pass
using RPC. For more details see :ref:`distributed-autograd-design`.

.. automodule:: torch.distributed.autograd
    :members: context, backward, get_gradients

Distributed Optimizer
---------------------

.. automodule:: torch.distributed.optim
    :members: DistributedOptimizer