Quickstart

pip install torchelastic

# start a single-node etcd server on ONE host
etcd --enable-v2
     --listen-client-urls http://0.0.0.0:2379,http://127.0.0.1:4001
     --advertise-client-urls PUBLIC_HOSTNAME:2379

To launch a fault-tolerant job, run the following on all nodes.

python -m torchelastic.distributed.launch
        --nnodes=NUM_NODES
        --nproc_per_node=TRAINERS_PER_NODE
        --rdzv_id=JOB_ID
        --rdzv_backend=etcd
        --rdzv_endpoint=ETCD_HOST:ETCD_PORT
        YOUR_TRAINING_SCRIPT.py (--arg1 ... train script args...)

To launch an elastic job, run the following on at least MIN_SIZE nodes and at most MAX_SIZE nodes.

python -m torchelastic.distributed.launch
        --nnodes=MIN_SIZE:MAX_SIZE
        --nproc_per_node=TRAINERS_PER_NODE
        --rdzv_id=JOB_ID
        --rdzv_backend=etcd
        --rdzv_endpoint=ETCD_HOST:ETCD_PORT
        YOUR_TRAINING_SCRIPT.py (--arg1 ... train script args...)

Note

The –standalone option can be passed to launch a single node job with a sidecar rendezvous server. You don’t have to pass —rdzv_id, —rdzv_endpoint, and —rdzv_backend when the —standalone option is used

Note

Learn more about writing your distributed training script here.

If torchelastic.distributed.launch does not meet your requirements you may use our APIs directly for more powerful customization. Start by taking a look at the elastic agent API).

Quickstart

Docs

Tutorials

Resources