Distributed and Parallel Training Tutorials¶
Created On: Oct 04, 2022 | Last Updated: Oct 31, 2024 | Last Verified: Nov 05, 2024
Distributed training is a model training paradigm that involves spreading training workload across multiple worker nodes, therefore significantly improving the speed of training and model accuracy. While distributed training can be used for any type of ML model training, it is most beneficial to use it for large models and compute demanding tasks as deep learning.
There are a few ways you can perform distributed training in PyTorch with each method having their advantages in certain use cases:
Read more about these options in Distributed Overview.
Learn DDP¶
A step-by-step video series on how to get started with DistributedDataParallel and advance to more complex topics
This tutorial provides a short and gentle intro to the PyTorch DistributedData Parallel.
This tutorial describes the Join context manager and demonstrates it’s use with DistributedData Parallel.
Learn FSDP¶
This tutorial demonstrates how you can perform distributed training with FSDP on a MNIST dataset.
In this tutorial, you will learn how to fine-tune a HuggingFace (HF) T5 model with FSDP for text summarization.
Learn Tensor Parallel (TP)¶
This tutorial demonstrates how to train a large Transformer-like model across hundreds to thousands of GPUs using Tensor Parallel and Fully Sharded Data Parallel.
Learn DeviceMesh¶
In this tutorial you will learn about DeviceMesh and how it can help with distributed training.
Learn RPC¶
This tutorial demonstrates how to get started with RPC-based distributed training.
This tutorial walks you through a simple example of implementing a parameter server using PyTorch’s Distributed RPC framework.
In this tutorial you will build batch-processing RPC applications with the @rpc.functions.async_execution decorator.
In this tutorial you will learn how to combine distributed data parallelism with distributed model parallelism.
Custom Extensions¶
In this tutorial you will learn to implement a custom ProcessGroup backend and plug that into PyTorch distributed package using cpp extensions.