Multiprocessing best practices

torch.multiprocessing is a drop in replacement for Python’s multiprocessing module. It supports the exact same operations, but extends it, so that all tensors sent through a multiprocessing.Queue, will have their data moved into shared memory and will only send a handle to another process.

Note

When a Tensor is sent to another process, the Tensor data is shared. If torch.Tensor.grad is not None, it is also shared. After a Tensor without a torch.Tensor.grad field is sent to the other process, it creates a standard process-specific .grad Tensor that is not automatically shared across all processes, unlike how the Tensor’s data has been shared.

This allows to implement various training methods, like Hogwild, A3C, or any others that require asynchronous operation.

CUDA in multiprocessing

The CUDA runtime does not support the fork start method; either the spawn or forkserver start method are required to use CUDA in subprocesses.

Note

The start method can be set via either creating a context with multiprocessing.get_context(...) or directly using multiprocessing.set_start_method(...).

Unlike CPU tensors, the sending process is required to keep the original tensor as long as the receiving process retains a copy of the tensor. It is implemented under the hood but requires users to follow the best practices for the program to run correctly. For example, the sending process must stay alive as long as the consumer process has references to the tensor, and the refcounting can not save you if the consumer process exits abnormally via a fatal signal. See this section.

Best practices and tips

Avoiding and fighting deadlocks

There are a lot of things that can go wrong when a new process is spawned, with the most common cause of deadlocks being background threads. If there’s any thread that holds a lock or imports a module, and fork is called, it’s very likely that the subprocess will be in a corrupted state and will deadlock or fail in a different way. Note that even if you don’t, Python built in libraries do - no need to look further than multiprocessing. multiprocessing.Queue is actually a very complex class, that spawns multiple threads used to serialize, send and receive objects, and they can cause aforementioned problems too. If you find yourself in such situation try using a SimpleQueue, that doesn’t use any additional threads.

We’re trying our best to make it easy for you and ensure these deadlocks don’t happen but some things are out of our control. If you have any issues you can’t cope with for a while, try reaching out on forums, and we’ll see if it’s an issue we can fix.

Reuse buffers passed through a Queue

Remember that each time you put a Tensor into a multiprocessing.Queue, it has to be moved into shared memory. If it’s already shared, it is a no-op, otherwise it will incur an additional memory copy that can slow down the whole process. Even if you have a pool of processes sending data to a single one, make it send the buffers back - this is nearly free and will let you avoid a copy when sending next batch.

Asynchronous multiprocess training (e.g. Hogwild)

Using torch.multiprocessing, it is possible to train a model asynchronously, with parameters either shared all the time, or being periodically synchronized. In the first case, we recommend sending over the whole model object, while in the latter, we advise to only send the state_dict().

We recommend using multiprocessing.Queue for passing all kinds of PyTorch objects between processes. It is possible to e.g. inherit the tensors and storages already in shared memory, when using the fork start method, however it is very bug prone and should be used with care, and only by advanced users. Queues, even though they’re sometimes a less elegant solution, will work properly in all cases.

Warning

You should be careful about having global statements, that are not guarded with an if __name__ == '__main__'. If a different start method than fork is used, they will be executed in all subprocesses.

Hogwild

A concrete Hogwild implementation can be found in the examples repository, but to showcase the overall structure of the code, there’s also a minimal example below as well:

import torch.multiprocessing as mp
from model import MyModel

def train(model):
    # Construct data_loader, optimizer, etc.
    for data, labels in data_loader:
        optimizer.zero_grad()
        loss_fn(model(data), labels).backward()
        optimizer.step()  # This will update the shared parameters

if __name__ == '__main__':
    num_processes = 4
    model = MyModel()
    # NOTE: this is required for the ``fork`` method to work
    model.share_memory()
    processes = []
    for rank in range(num_processes):
        p = mp.Process(target=train, args=(model,))
        p.start()
        processes.append(p)
    for p in processes:
        p.join()

CPU in multiprocessing

Inappropriate multiprocessing can lead to CPU oversubscription, causing different processes to compete for CPU resources, resulting in low efficiency.

This tutorial will explain what CPU oversubscription is and how to avoid it.

CPU oversubscription

CPU oversubscription is a technical term that refers to a situation where the total number of vCPUs allocated to a system exceeds the total number of vCPUs available on the hardware.

This leads to severe contention for CPU resources. In such cases, there is frequent switching between processes, which increases processes switching overhead and decreases overall system efficiency.

See CPU oversubscription with the code examples in the Hogwild implementation found in the example repository.

When running the training example with the following command on CPU using 4 processes:

python main.py --num-processes 4

Assuming there are N vCPUs available on the machine, executing the above command will generate 4 subprocesses. Each subprocess will allocate N vCPUs for itself, resulting in a requirement of 4*N vCPUs. However, the machine only has N vCPUs available. Consequently, the different processes will compete for resources, leading to frequent process switching.

The following observations indicate the presence of CPU over subscription:

High CPU Utilization: By using the htop command, you can observe that the CPU utilization is consistently high, often reaching or exceeding its maximum capacity. This indicates that the demand for CPU resources exceeds the available physical cores, causing contention and competition among processes for CPU time.
Frequent Context Switching with Low System Efficiency: In an oversubscribed CPU scenario, processes compete for CPU time, and the operating system needs to rapidly switch between different processes to allocate resources fairly. This frequent context switching adds overhead and reduces the overall system efficiency.

Avoid CPU oversubscription

A good way to avoid CPU oversubscription is proper resource allocation. Ensure that the number of processes or threads running concurrently does not exceed the available CPU resources.

In this case, a solution would be to specify the appropriate number of threads in the subprocesses. This can be achieved by setting the number of threads for each process using the torch.set_num_threads(int) function in subprocess.

Assuming there are N vCPUs on the machine and M processes will be generated, the maximum num_threads value used by each process would be floor(N/M). To avoid CPU oversubscription in the mnist_hogwild example, the following changes are needed for the file train.py in example repository.

def train(rank, args, model, device, dataset, dataloader_kwargs):
    torch.manual_seed(args.seed + rank)

    #### define the num threads used in current sub-processes
    torch.set_num_threads(floor(N/M))

    train_loader = torch.utils.data.DataLoader(dataset, **dataloader_kwargs)

    optimizer = optim.SGD(model.parameters(), lr=args.lr, momentum=args.momentum)
    for epoch in range(1, args.epochs + 1):
        train_epoch(epoch, args, model, device, train_loader, optimizer)

Set num_thread for each process using torch.set_num_threads(floor(N/M)). where you replace N with the number of vCPUs available and M with the chosen number of processes. The appropriate num_thread value will vary depending on the specific task at hand. However, as a general guideline, the maximum value for the num_thread should be floor(N/M) to avoid CPU oversubscription. In the mnist_hogwild training example, after avoiding CPU over subscription, you can achieve a 30x performance boost.

Multiprocessing best practices

CUDA in multiprocessing

Best practices and tips

Avoiding and fighting deadlocks

Reuse buffers passed through a Queue

Asynchronous multiprocess training (e.g. Hogwild)

Hogwild

CPU in multiprocessing

CPU oversubscription

Avoid CPU oversubscription

Docs

Tutorials

Resources