Shortcuts

PYTORCH ProcessGroupNCCL Environment Variables

For more information on the environment variables, see ProcessGroupNCCL Environment Variables.

Variable

Description

TORCH_NCCL_ASYNC_ERROR_HANDLING

Control how we perform Async Error Handling with NCCL when an exception is observed in watchdog. If set to 0, no handling of asynchronous NCCL errors. If set to 1, aborting NCCL communicator and tearing down process upon error. If set to 2, only abort NCCL communicator and if set to 3, tearing down process without aborting NCCL communicator. By default, it is set to 3.

TORCH_NCCL_HIGH_PRIORITY

Control whether to use high priority stream for the NCCL communicator.

TORCH_NCCL_BLOCKING_WAIT

Control whether or not wait() is blocking or non-blocking.

TORCH_NCCL_DUMP_ON_TIMEOUT

Control whether dumping debug info on watchdog timeout or exception is detected. This variable must be set together with TORCH_NCCL_TRACE_BUFFER_SIZE larger than 0.

TORCH_NCCL_DESYNC_DEBUG

Control whether Desync Debug is enabled. This is helpful in figuring out the culprit rank of collective desync.

TORCH_NCCL_ENABLE_TIMING

If set to 1, enable recording start-events for all ProcessGroupNCCL collectives, and compute accurate collective timing per-collective.

TORCH_NCCL_ENABLE_MONITORING

If set to 1,enable monitoring thread which aborts the process when the ProcessGroupNCCL Watchdog thread gets stuck and no heartbeat is detected after TORCH_NCCL_HEARTBEAT_TIMEOUT_SEC. This can happen due to calling CUDA/NCCL APIs that may hang. It is Useful to prevent jobs being stuck for a prolonged time than necessary tying up cluster resources.

TORCH_NCCL_HEARTBEAT_TIMEOUT_SEC

Control the watchdog heartbeat timeout period after which the monitoring thread will abort the process.

TORCH_NCCL_TRACE_BUFFER_SIZE

The maximum number of events we store in the flight recorder’s ring buffer. One event could be the start or end of a collective, for example. Set to 0 to disable the tracebuffer and debugging info dump.

TORCH_NCCL_TRACE_CPP_STACK

Whether to collect cpp stack traces for flight recorder. Default value is False.

TORCH_NCCL_COORD_CHECK_MILSEC

Control the interval inside the monitoring thread to check the coordinated signal from other ranks, e.g. to dump the debugging information. Default value is 1000 ms.

TORCH_NCCL_WAIT_TIMEOUT_DUMP_MILSEC

Control how much extra time we will wait for dumping the debugging info before we exit and throws timeout exception.

TORCH_NCCL_DEBUG_INFO_TEMP_FILE

The file into which the debugging info would be dumped.

TORCH_NCCL_DEBUG_INFO_PIPE_FILE

The pipe file to trigger debugging dump manually, write anything into the pipe would trigger the dump.

TORCH_NCCL_NAN_CHECK

Control whether to enable NAN check for the input, Error would be thrown if NAN is detected.

Docs

Access comprehensive developer documentation for PyTorch

View Docs

Tutorials

Get in-depth tutorials for beginners and advanced developers

View Tutorials

Resources

Find development resources and get your questions answered

View Resources