torch.cuda

This package adds support for CUDA tensor types, that implement the same function as CPU tensors, but they utilize GPUs for computation.

It is lazily initialized, so you can always import it, and use is_available() to determine if your system supports CUDA.

CUDA semantics has more details about working with CUDA.

`StreamContext`	Context-manager that selects a given stream.
`can_device_access_peer`	Checks if peer access between two devices is possible.
`current_blas_handle`	Returns cublasHandle_t pointer to current cuBLAS handle
`current_device`	Returns the index of a currently selected device.
`current_stream`	Returns the currently selected `Stream` for a given device.
`default_stream`	Returns the default `Stream` for a given device.
`device`	Context-manager that changes the selected device.
`device_count`	Returns the number of GPUs available.
`device_of`	Context-manager that changes the current device to that of given object.
`get_arch_list`	Returns list CUDA architectures this library was compiled for.
`get_device_capability`	Gets the cuda capability of a device.
`get_device_name`	Gets the name of a device.
`get_device_properties`	Gets the properties of a device.
`get_gencode_flags`	Returns NVCC gencode flags this library was compiled with.
`get_sync_debug_mode`	Returns current value of debug mode for cuda synchronizing operations.
`init`	Initialize PyTorch’s CUDA state.
`ipc_collect`	Force collects GPU memory after it has been released by CUDA IPC.
`is_available`	Returns a bool indicating if CUDA is currently available.
`is_initialized`	Returns whether PyTorch’s CUDA state has been initialized.
`set_device`	Sets the current device.
`set_stream`	Sets the current stream.This is a wrapper API to set the stream.
`set_sync_debug_mode`	Sets the debug mode for cuda synchronizing operations.
`stream`	Wrapper around the Context-manager StreamContext that selects a given stream.
`synchronize`	Waits for all kernels in all streams on a CUDA device to complete.

Random Number Generator

`get_rng_state`	Returns the random number generator state of the specified GPU as a ByteTensor.
`get_rng_state_all`	Returns a list of ByteTensor representing the random number states of all devices.
`set_rng_state`	Sets the random number generator state of the specified GPU.
`set_rng_state_all`	Sets the random number generator state of all devices.
`manual_seed`	Sets the seed for generating random numbers for the current GPU.
`manual_seed_all`	Sets the seed for generating random numbers on all GPUs.
`seed`	Sets the seed for generating random numbers to a random number for the current GPU.
`seed_all`	Sets the seed for generating random numbers to a random number on all GPUs.
`initial_seed`	Returns the current random seed of the current GPU.

Communication collectives

`comm.broadcast`	Broadcasts a tensor to specified GPU devices.
`comm.broadcast_coalesced`	Broadcasts a sequence tensors to the specified GPUs.
`comm.reduce_add`	Sums tensors from multiple GPUs.
`comm.scatter`	Scatters tensor across multiple GPUs.
`comm.gather`	Gathers tensors from multiple GPU devices.

Streams and events

Stream

Wrapper around a CUDA stream.

Event

Wrapper around a CUDA event.

Graphs (beta)

`graph_pool_handle`	Returns an opaque token representing the id of a graph memory pool.
`CUDAGraph`	Wrapper around a CUDA graph.
`graph`	Context-manager that captures CUDA work into a `torch.cuda.CUDAGraph` object for later replay.
`make_graphed_callables`	Accepts callables (functions or `nn.Module`s) and returns graphed versions.

Memory management

`empty_cache`	Releases all unoccupied cached memory currently held by the caching allocator so that those can be used in other GPU application and visible in nvidia-smi.
`list_gpu_processes`	Returns a human-readable printout of the running processes and their GPU memory use for a given device.
`memory_stats`	Returns a dictionary of CUDA memory allocator statistics for a given device.
`memory_summary`	Returns a human-readable printout of the current memory allocator statistics for a given device.
`memory_snapshot`	Returns a snapshot of the CUDA memory allocator state across all devices.
`memory_allocated`	Returns the current GPU memory occupied by tensors in bytes for a given device.
`max_memory_allocated`	Returns the maximum GPU memory occupied by tensors in bytes for a given device.
`reset_max_memory_allocated`	Resets the starting point in tracking maximum GPU memory occupied by tensors for a given device.
`memory_reserved`	Returns the current GPU memory managed by the caching allocator in bytes for a given device.
`max_memory_reserved`	Returns the maximum GPU memory managed by the caching allocator in bytes for a given device.
`set_per_process_memory_fraction`	Set memory fraction for a process.
`memory_cached`	Deprecated; see `memory_reserved()`.
`max_memory_cached`	Deprecated; see `max_memory_reserved()`.
`reset_max_memory_cached`	Resets the starting point in tracking maximum GPU memory managed by the caching allocator for a given device.
`reset_peak_memory_stats`	Resets the “peak” stats tracked by the CUDA memory allocator.

NVIDIA Tools Extension (NVTX)

`nvtx.mark`	Describe an instantaneous event that occurred at some point.
`nvtx.range_push`	Pushes a range onto a stack of nested range span.
`nvtx.range_pop`	Pops a range off of a stack of nested range spans.