Quickstart

pip install torchelastic

# start a single-node etcd server on ONE host
etcd --enable-v2
     --listen-client-urls http://0.0.0.0:2379,http://127.0.0.1:4001
     --advertise-client-urls PUBLIC_HOSTNAME:2379

To launch a fault-tolerant job, run the following on all nodes.

python -m torchelastic.distributed.launch
        --nnodes=NUM_NODES
        --nproc_per_node=TRAINERS_PER_NODE
        --rdzv_id=JOB_ID
        --rdzv_backend=etcd
        --rdzv_endpoint=ETCD_HOST:ETCD_PORT
        YOUR_TRAINING_SCRIPT.py (--arg1 ... train script args...)

To launch an elastic job, run the following on at least MIN_SIZE nodes and at most MAX_SIZE nodes.

python -m torchelastic.distributed.launch
        --nnodes=MIN_SIZE:MAX_SIZE
        --nproc_per_node=TRAINERS_PER_NODE
        --rdzv_id=JOB_ID
        --rdzv_backend=etcd
        --rdzv_endpoint=ETCD_HOST:ETCD_PORT
        YOUR_TRAINING_SCRIPT.py (--arg1 ... train script args...)

Note

Learn more about writing your distributed training script here <train_script.html>.

If torchelastic.distributed.launch does not meet your requirements you may use our APIs directly for more powerful customization. Start by taking a look at the elastic agent API).

Quickstart

Docs

Tutorials

Resources