Parallel#
- class ignite.distributed.launcher.Parallel(backend=None, nproc_per_node=None, nnodes=None, node_rank=None, master_addr=None, master_port=None, init_method=None, **spawn_kwargs)[source]#
Distributed launcher context manager to simplify distributed configuration setup for multiple backends:
backends from native torch distributed configuration: “nccl”, “gloo” and “mpi” (if available)
XLA on TPUs via pytorch/xla (if installed)
using Horovod distributed framework (if installed)
Namely, it can:
1) Spawn
nproc_per_node
child processes and initialize a processing group according to providedbackend
(useful for standalone scripts).2) Only initialize a processing group given the
backend
(useful with tools like torchrun, horovodrun, etc).- Parameters
backend (Optional[str]) – backend to use: nccl, gloo, xla-tpu, horovod. If None, no distributed configuration.
nproc_per_node (Optional[int]) – optional argument, number of processes per node to specify. If not None,
run()
will spawnnproc_per_node
processes that run input function with its arguments.nnodes (Optional[int]) – optional argument, number of nodes participating in distributed configuration. If not None,
run()
will spawnnproc_per_node
processes that run input function with its arguments. Total world size is nproc_per_node * nnodes. This option is only supported by native torch distributed module. For other modules, please setupspawn_kwargs
with backend specific arguments.node_rank (Optional[int]) – optional argument, current machine index. Mandatory argument if
nnodes
is specified and larger than one. This option is only supported by native torch distributed module. For other modules, please setupspawn_kwargs
with backend specific arguments.master_addr (Optional[str]) – optional argument, master node TCP/IP address for torch native backends (nccl, gloo). Mandatory argument if
nnodes
is specified and larger than one.master_port (Optional[int]) – optional argument, master node port for torch native backends (nccl, gloo). Mandatory argument if
master_addr
is specified.init_method (Optional[str]) – optional argument to specify processing group initialization method for torch native backends (nccl, gloo). Default, “env://”. See more info: dist.init_process_group.
spawn_kwargs (Any) – kwargs to
idist.spawn
function.
Examples
1) Single node or Multi-node, Multi-GPU training launched with torchrun or horovodrun tools
Single node option with 4 GPUs
torchrun --nproc_per_node=4 main.py # or if installed horovod horovodrun -np=4 python main.py
Multi-node option : 2 nodes with 8 GPUs each
## node 0 torchrun --nnodes=2 --node_rank=0 --master_addr=master --master_port=3344 --nproc_per_node=8 main.py # or if installed horovod horovodrun -np 16 -H hostname1:8,hostname2:8 python main.py ## node 1 torchrun --nnodes=2 --node_rank=1 --master_addr=master --master_port=3344 --nproc_per_node=8 main.py
User code is the same for both options:
# main.py import ignite.distributed as idist def training(local_rank, config, **kwargs): # ... print(idist.get_rank(), ": run with config:", config, "- backend=", idist.backend()) # ... backend = "nccl" # or "horovod" if package is installed config = {"key": "value"} with idist.Parallel(backend=backend) as parallel: parallel.run(training, config, a=1, b=2)
Single node, Multi-GPU training launched with python
python main.py
# main.py import ignite.distributed as idist def training(local_rank, config, **kwargs): # ... print(idist.get_rank(), ": run with config:", config, "- backend=", idist.backend()) # ... backend = "nccl" # or "horovod" if package is installed # no "init_method" was specified , "env://" will be used with idist.Parallel(backend=backend, nproc_per_node=4) as parallel: parallel.run(training, config, a=1, b=2)
Initializing the process using
file://
with idist.Parallel(backend=backend, init_method='file:///d:/tmp/some_file', nproc_per_node=4) as parallel: parallel.run(training, config, a=1, b=2)
Initializing the process using
tcp://
with idist.Parallel(backend=backend, init_method='tcp://10.1.1.20:23456', nproc_per_node=4) as parallel: parallel.run(training, config, a=1, b=2)
Single node, Multi-TPU training launched with python
python main.py
# main.py import ignite.distributed as idist def training(local_rank, config, **kwargs): # ... print(idist.get_rank(), ": run with config:", config, "- backend=", idist.backend()) # ... config = {"key": "value"} with idist.Parallel(backend="xla-tpu", nproc_per_node=8) as parallel: parallel.run(training, config, a=1, b=2)
Multi-node, Multi-GPU training launched with python. For example, 2 nodes with 8 GPUs:
Using torch native distributed framework:
# node 0 python main.py --node_rank=0 # node 1 python main.py --node_rank=1
# main.py import ignite.distributed as idist def training(local_rank, config, **kwargs): # ... print(idist.get_rank(), ": run with config:", config, "- backend=", idist.backend()) # ... dist_config = { "nproc_per_node": 8, "nnodes": 2, "node_rank": args.node_rank, "master_addr": "master", "master_port": 15000 } config = {"key": "value"} with idist.Parallel(backend="nccl", **dist_config) as parallel: parallel.run(training, config, a=1, b=2)
Changed in version 0.4.2:
backend
now accepts horovod distributed framework.Changed in version 0.4.5:
init_method
added.Methods
Execute
func
with provided arguments in distributed context.- run(func, *args, **kwargs)[source]#
Execute
func
with provided arguments in distributed context.- Parameters
- Return type
None
Examples
def training(local_rank, config, **kwargs): # ... print(idist.get_rank(), ": run with config:", config, "- backend=", idist.backend()) # ... config = {"key": "value"} with idist.Parallel(backend=backend) as parallel: parallel.run(training, config, a=1, b=2)