Shortcuts

Introduction || What is DDP || Single-Node Multi-GPU Training || Fault Tolerance || Multi-Node training || minGPT Training

Multinode Training

Created On: Sep 27, 2022 | Last Updated: Jul 10, 2024 | Last Verified: Nov 05, 2024

Authors: Suraj Subramanian

What you will learn
  • Launching multinode training jobs with torchrun

  • Code changes (and things to keep in mind) when moving from single-node to multinode training.

View the code used in this tutorial on GitHub

Prerequisites
  • Familiarity with multi-GPU training and torchrun

  • 2 or more TCP-reachable GPU machines (this tutorial uses AWS p3.2xlarge instances)

  • PyTorch installed with CUDA on all machines

Follow along with the video below or on youtube.

Multinode training involves deploying a training job across several machines. There are two ways to do this:

  • running a torchrun command on each machine with identical rendezvous arguments, or

  • deploying it on a compute cluster using a workload manager (like SLURM)

In this video we will go over the (minimal) code changes required to move from single-node multigpu to multinode training, and run our training script in both of the above ways.

Note that multinode training is bottlenecked by inter-node communication latencies. Running a training job on 4 GPUs on a single node will be faster than running it on 4 nodes with 1 GPU each.

Local and Global ranks

In single-node settings, we were tracking the gpu_id of each device running our training process. torchrun tracks this value in an environment variable LOCAL_RANK which uniquely identifies each GPU-process on a node. For a unique identifier across all the nodes, torchrun provides another variable RANK which refers to the global rank of a process.

Warning

Do not use RANK for critical logic in your training job. When torchrun restarts processes after a failure or membership changes, there is no guarantee that the processes will hold the same LOCAL_RANK and RANKS.

Heteregeneous Scaling

Torchrun supports heteregenous scaling i.e. each of your multinode machines can have different number of GPUs participating in the training job. In the video, I deployed the code on 2 machines where one machine has 4 GPUs and the other used only 2 GPUs.

Troubleshooting

  • Ensure that your nodes are able to communicate with each other over TCP.

  • Set env variable NCCL_DEBUG to INFO (using export NCCL_DEBUG=INFO) to print verbose logs that can help diagnose the issue.

  • Sometimes you might need to explicitly set the network interface for the distributed backend (export NCCL_SOCKET_IFNAME=eth0). Read more about this here.

Further Reading

Docs

Access comprehensive developer documentation for PyTorch

View Docs

Tutorials

Get in-depth tutorials for beginners and advanced developers

View Tutorials

Resources

Find development resources and get your questions answered

View Resources