Quickstart
pip install torchelastic
# start a single-node etcd server on ONE host
etcd --enable-v2
--listen-client-urls http://0.0.0.0:2379,http://127.0.0.1:4001
--advertise-client-urls PUBLIC_HOSTNAME:2379
To launch a fault-tolerant job, run the following on all nodes.
python -m torchelastic.distributed.launch
--nnodes=NUM_NODES
--nproc_per_node=TRAINERS_PER_NODE
--rdzv_id=JOB_ID
--rdzv_backend=etcd
--rdzv_endpoint=ETCD_HOST:ETCD_PORT
YOUR_TRAINING_SCRIPT.py (--arg1 ... train script args...)
To launch an elastic job, run the following on at least MIN_SIZE
nodes
and at most MAX_SIZE
nodes.
python -m torchelastic.distributed.launch
--nnodes=MIN_SIZE:MAX_SIZE
--nproc_per_node=TRAINERS_PER_NODE
--rdzv_id=JOB_ID
--rdzv_backend=etcd
--rdzv_endpoint=ETCD_HOST:ETCD_PORT
YOUR_TRAINING_SCRIPT.py (--arg1 ... train script args...)
Note
The –standalone option can be passed to launch a single node job with a sidecar rendezvous server. You don’t have to pass —rdzv_id, —rdzv_endpoint, and —rdzv_backend when the —standalone option is used
Note
Learn more about writing your distributed training script here.
If torchelastic.distributed.launch
does not meet your requirements
you may use our APIs directly for more powerful customization. Start by
taking a look at the elastic agent API).