Quickstart =========== .. code-block:: bash pip install torchelastic # start a single-node etcd server on ONE host etcd --enable-v2 --listen-client-urls http://0.0.0.0:2379,http://127.0.0.1:4001 --advertise-client-urls PUBLIC_HOSTNAME:2379 To launch a **fault-tolerant** job, run the following on all nodes. .. code-block:: bash python -m torchelastic.distributed.launch --nnodes=NUM_NODES --nproc_per_node=TRAINERS_PER_NODE --rdzv_id=JOB_ID --rdzv_backend=etcd --rdzv_endpoint=ETCD_HOST:ETCD_PORT YOUR_TRAINING_SCRIPT.py (--arg1 ... train script args...) To launch an **elastic** job, run the following on at least ``MIN_SIZE`` nodes and at most ``MAX_SIZE`` nodes. .. code-block:: bash python -m torchelastic.distributed.launch --nnodes=MIN_SIZE:MAX_SIZE --nproc_per_node=TRAINERS_PER_NODE --rdzv_id=JOB_ID --rdzv_backend=etcd --rdzv_endpoint=ETCD_HOST:ETCD_PORT YOUR_TRAINING_SCRIPT.py (--arg1 ... train script args...) .. note:: The `--standalone` option can be passed to launch a single node job with a sidecar rendezvous server. You don’t have to pass —rdzv_id, —rdzv_endpoint, and —rdzv_backend when the —standalone option is used .. note:: Learn more about writing your distributed training script `here <train_script.html>`_. If ``torchelastic.distributed.launch`` does not meet your requirements you may use our APIs directly for more powerful customization. Start by taking a look at the `elastic agent <agent.html>`_ API).