Quickstart
pip install torchelastic
# start a single-node etcd server on ONE host
etcd --enable-v2
--listen-client-urls http://0.0.0.0:2379,http://127.0.0.1:4001
--advertise-client-urls PUBLIC_HOSTNAME:2379
To launch a fault-tolerant job, run the following on all nodes.
python -m torchelastic.distributed.launch
--nnodes=NUM_NODES
--nproc_per_node=TRAINERS_PER_NODE
--rdzv_id=JOB_ID
--rdzv_backend=etcd
--rdzv_endpoint=ETCD_HOST:ETCD_PORT
YOUR_TRAINING_SCRIPT.py (--arg1 ... train script args...)
To launch an elastic job, run the following on at least MIN_SIZE
nodes
and at most MAX_SIZE
nodes.
python -m torchelastic.distributed.launch
--nnodes=MIN_SIZE:MAX_SIZE
--nproc_per_node=TRAINERS_PER_NODE
--rdzv_id=JOB_ID
--rdzv_backend=etcd
--rdzv_endpoint=ETCD_HOST:ETCD_PORT
YOUR_TRAINING_SCRIPT.py (--arg1 ... train script args...)
Note
Learn more about writing your distributed training script here <train_script.html>.
If torchelastic.distributed.launch
does not meet your requirements
you may use our APIs directly for more powerful customization. Start by
taking a look at the elastic agent API).