Shortcuts

Quickstart

To launch a fault-tolerant job, run the following on all nodes.

torchrun
   --nnodes=NUM_NODES
   --nproc_per_node=TRAINERS_PER_NODE
   --rdzv_id=JOB_ID
   --rdzv_backend=c10d
   --rdzv_endpoint=HOST_NODE_ADDR
   YOUR_TRAINING_SCRIPT.py (--arg1 ... train script args...)

To launch an elastic job, run the following on at least MIN_SIZE nodes and at most MAX_SIZE nodes.

torchrun
    --nnodes=MIN_SIZE:MAX_SIZE
    --nproc_per_node=TRAINERS_PER_NODE
    --rdzv_id=JOB_ID
    --rdzv_backend=c10d
    --rdzv_endpoint=HOST_NODE_ADDR
    YOUR_TRAINING_SCRIPT.py (--arg1 ... train script args...)

HOST_NODE_ADDR, in form <host>[:<port>] (e.g. node1.example.com:29400), specifies the node and the port on which the C10d rendezvous backend should be instantiated and hosted. It can be any node in your training cluster, but ideally you should pick a node that has a high bandwidth.

Note

If no port number is specified HOST_NODE_ADDR defaults to 29400.

Note

The --standalone option can be passed to launch a single node job with a sidecar rendezvous backend. You don’t have to pass --rdzv_id, --rdzv_endpoint, and --rdzv_backend when the --standalone option is used.

Note

Learn more about writing your distributed training script here.

If torchrun does not meet your requirements you may use our APIs directly for more powerful customization. Start by taking a look at the elastic agent API.

Docs

Access comprehensive developer documentation for PyTorch

View Docs

Tutorials

Get in-depth tutorials for beginners and advanced developers

View Tutorials

Resources

Find development resources and get your questions answered

View Resources