PYTORCH ProcessGroupNCCL Environment Variables¶
For more information on the environment variables, see ProcessGroupNCCL Environment Variables.
Variable |
Description |
---|---|
|
Control whether to use high priority stream for the NCCL communicator. |
|
Control whether or not wait() is blocking or non-blocking. |
|
Control whether dumping debug info on watchdog timeout or exception is detected. This variable must be set together with TORCH_NCCL_TRACE_BUFFER_SIZE larger than 0. |
|
Control whether Desync Debug is enabled. This is helpful in figuring out the culprit rank of collective desync. |
|
If set to |
|
If set to |
|
Control the watchdog heartbeat timeout period after which the monitoring thread will abort the process. |
|
The maximum number of events we store in the flight recorder’s ring buffer. One event could be the start or end of a collective, for example. Set to 0 to disable the tracebuffer and debugging info dump. |
|
Control how much extra time we will wait for dumping the debugging info before we exit and throws timeout exception. |
|
The file into which the debugging info would be dumped. |
|
The pipe file to trigger debugging dump manually, write anything into the pipe would trigger the dump. |
|
Control whether to enable NAN check for the input, Error would be thrown if NAN is detected. |