Introduction || What is DDP || Single-Node Multi-GPU Training || Fault Tolerance || Multi-Node training || minGPT Training

Multinode Training

Created On: Sep 27, 2022 | Last Updated: Jan 23, 2025 | Last Verified: Nov 05, 2024

Authors: Suraj Subramanian

What you will learn

Launching multinode training jobs with torchrun
Code changes (and things to keep in mind) when moving from single-node to multinode training.

View the code used in this tutorial on GitHub

Prerequisites

Familiarity with multi-GPU training and torchrun
2 or more TCP-reachable GPU machines (this tutorial uses AWS p3.2xlarge instances)
PyTorch installed with CUDA on all machines

Follow along with the video below or on youtube.

Multinode training involves deploying a training job across several machines. There are two ways to do this:

running a torchrun command on each machine with identical rendezvous arguments, or
deploying it on a compute cluster using a workload manager (like SLURM)

In this video we will go over the (minimal) code changes required to move from single-node multigpu to multinode training, and run our training script in both of the above ways.

Note that multinode training is bottlenecked by inter-node communication latencies. Running a training job on 4 GPUs on a single node will be faster than running it on 4 nodes with 1 GPU each.

Local and Global ranks

In single-node settings, we were tracking the gpu_id of each device running our training process. torchrun tracks this value in an environment variable LOCAL_RANK which uniquely identifies each GPU-process on a node. For a unique identifier across all the nodes, torchrun provides another variable RANK which refers to the global rank of a process.

Warning

Do not use RANK for critical logic in your training job. When torchrun restarts processes after a failure or membership changes, there is no guarantee that the processes will hold the same LOCAL_RANK and RANKS.

Heteregeneous Scaling

Torchrun supports heteregenous scaling i.e. each of your multinode machines can have different number of GPUs participating in the training job. In the video, I deployed the code on 2 machines where one machine has 4 GPUs and the other used only 2 GPUs.

Troubleshooting

Ensure that your nodes are able to communicate with each other over TCP.
Set env variable NCCL_DEBUG to INFO (using export NCCL_DEBUG=INFO) to print verbose logs that can help diagnose the issue.
Sometimes you might need to explicitly set the network interface for the distributed backend (export NCCL_SOCKET_IFNAME=eth0). Read more about this here.

Multinode Training

Local and Global ranks

Heteregeneous Scaling

Troubleshooting

Further Reading

Docs

Tutorials

Resources