torch.cuda

This package adds support for CUDA tensor types, that implement the same function as CPU tensors, but they utilize GPUs for computation.

It is lazily initialized, so you can always import it, and use is_available() to determine if your system supports CUDA.

CUDA semantics has more details about working with CUDA.

torch.cuda.current_blas_handle()[source]

Returns cublasHandle_t pointer to current cuBLAS handle

torch.cuda.current_device()[source]

Returns the index of a currently selected device.

torch.cuda.current_stream()[source]

Returns a currently selected Stream.

class torch.cuda.device(idx)[source]

Context-manager that changes the selected device.

Parameters:idx (int) – device index to select. It’s a no-op if this argument is negative.
torch.cuda.device_count()[source]

Returns the number of GPUs available.

torch.cuda.device_ctx_manager

alias of device

class torch.cuda.device_of(obj)[source]

Context-manager that changes the current device to that of given object.

You can use both tensors and storages as arguments. If a given object is not allocated on a GPU, this is a no-op.

Parameters:obj (Tensor or Storage) – object allocated on the selected device.
torch.cuda.empty_cache()[source]

Releases all unoccupied cached memory currently held by the caching allocator so that those can be used in other GPU application and visible in nvidia-smi.

torch.cuda.get_device_capability(device)[source]

Gets the cuda capability of a device.

Parameters:device (int) – device for which to return the name. This function is a no-op if this argument is negative.
Returns:the major and minor cuda capability of the device
Return type:tuple(int, int)
torch.cuda.get_device_name(device)[source]

Gets the name of a device.

Parameters:device (int) – device for which to return the name. This function is a no-op if this argument is negative.
torch.cuda.is_available()[source]

Returns a bool indicating if CUDA is currently available.

torch.cuda.set_device(device)[source]

Sets the current device.

Usage of this function is discouraged in favor of device. In most cases it’s better to use CUDA_VISIBLE_DEVICES environmental variable.

Parameters:device (int) – selected device. This function is a no-op if this argument is negative.
torch.cuda.stream(stream)[source]

Context-manager that selects a given stream.

All CUDA kernels queued within its context will be enqueued on a selected stream.

Parameters:stream (Stream) – selected stream. This manager is a no-op if it’s None.
torch.cuda.synchronize()[source]

Waits for all kernels in all streams on current device to complete.

Random Number Generator

torch.cuda.get_rng_state(device=-1)[source]

Returns the random number generator state of the current GPU as a ByteTensor.

Parameters:device (int, optional) – The device to return the RNG state of. Default: -1 (i.e., use the current device).

Warning

This function eagerly initializes CUDA.

torch.cuda.set_rng_state(new_state, device=-1)[source]

Sets the random number generator state of the current GPU.

Parameters:new_state (torch.ByteTensor) – The desired state
torch.cuda.manual_seed(seed)[source]

Sets the seed for generating random numbers for the current GPU. It’s safe to call this function if CUDA is not available; in that case, it is silently ignored.

Parameters:seed (int or long) – The desired seed.

Warning

If you are working with a multi-GPU model, this function is insufficient to get determinism. To seed all GPUs, use manual_seed_all().

torch.cuda.manual_seed_all(seed)[source]

Sets the seed for generating random numbers on all GPUs. It’s safe to call this function if CUDA is not available; in that case, it is silently ignored.

Parameters:seed (int or long) – The desired seed.
torch.cuda.seed()[source]

Sets the seed for generating random numbers to a random number for the current GPU. It’s safe to call this function if CUDA is not available; in that case, it is silently ignored.

Warning

If you are working with a multi-GPU model, this function will only initialize the seed on one GPU. To initialize all GPUs, use seed_all().

torch.cuda.seed_all()[source]

Sets the seed for generating random numbers to a random number on all GPUs. It’s safe to call this function if CUDA is not available; in that case, it is silently ignored.

torch.cuda.initial_seed()[source]

Returns the current random seed of the current GPU.

Warning

This function eagerly initializes CUDA.

Communication collectives

torch.cuda.comm.broadcast(tensor, devices)[source]

Broadcasts a tensor to a number of GPUs.

Parameters:
  • tensor (Tensor) – tensor to broadcast.
  • devices (Iterable) – an iterable of devices among which to broadcast. Note that it should be like (src, dst1, dst2, …), the first element of which is the source device to broadcast from.
Returns:

A tuple containing copies of the tensor, placed on devices corresponding to indices from devices.

torch.cuda.comm.reduce_add(inputs, destination=None)[source]

Sums tensors from multiple GPUs.

All inputs should have matching shapes.

Parameters:
  • inputs (Iterable[Tensor]) – an iterable of tensors to add.
  • destination (int, optional) – a device on which the output will be placed (default: current device).
Returns:

A tensor containing an elementwise sum of all inputs, placed on the destination device.

torch.cuda.comm.scatter(tensor, devices, chunk_sizes=None, dim=0, streams=None)[source]

Scatters tensor across multiple GPUs.

Parameters:
  • tensor (Tensor) – tensor to scatter.
  • devices (Iterable[int]) – iterable of ints, specifying among which devices the tensor should be scattered.
  • chunk_sizes (Iterable[int], optional) – sizes of chunks to be placed on each device. It should match devices in length and sum to tensor.size(dim). If not specified, the tensor will be divided into equal chunks.
  • dim (int, optional) – A dimension along which to chunk the tensor.
Returns:

A tuple containing chunks of the tensor, spread across given devices.

torch.cuda.comm.gather(tensors, dim=0, destination=None)[source]

Gathers tensors from multiple GPUs.

Tensor sizes in all dimension different than dim have to match.

Parameters:
  • tensors (Iterable[Tensor]) – iterable of tensors to gather.
  • dim (int) – a dimension along which the tensors will be concatenated.
  • destination (int, optional) – output device (-1 means CPU, default: current device)
Returns:

A tensor located on destination device, that is a result of concatenating tensors along dim.

Streams and events

class torch.cuda.Stream[source]

Wrapper around a CUDA stream.

Parameters:
  • device (int, optional) – a device on which to allocate the Stream.
  • priority (int, optional) – priority of the stream. Lower numbers represent higher priorities.
query()[source]

Checks if all the work submitted has been completed.

Returns:A boolean indicating if all kernels in this stream are completed.
record_event(event=None)[source]

Records an event.

Parameters:event (Event, optional) – event to record. If not given, a new one will be allocated.
Returns:Recorded event.
synchronize()[source]

Wait for all the kernels in this stream to complete.

wait_event(event)[source]

Makes all future work submitted to the stream wait for an event.

Parameters:event (Event) – an event to wait for.
wait_stream(stream)[source]

Synchronizes with another stream.

All future work submitted to this stream will wait until all kernels submitted to a given stream at the time of call complete.

Parameters:stream (Stream) – a stream to synchronize.
class torch.cuda.Event(enable_timing=False, blocking=False, interprocess=False, _handle=None)[source]

Wrapper around CUDA event.

Parameters:
  • enable_timing (bool) – indicates if the event should measure time (default: False)
  • blocking (bool) – if True, wait() will be blocking (default: False)
  • interprocess (bool) – if True, the event can be shared between processes (default: False)
elapsed_time(end_event)[source]

Returns the time elapsed before the event was recorded.

ipc_handle()[source]

Returns an IPC handle of this event.

query()[source]

Checks if the event has been recorded.

Returns:A boolean indicating if the event has been recorded.
record(stream=None)[source]

Records the event in a given stream.

synchronize()[source]

Synchronizes with the event.

wait(stream=None)[source]

Makes a given stream wait for the event.

Memory management

torch.cuda.empty_cache()[source]

Releases all unoccupied cached memory currently held by the caching allocator so that those can be used in other GPU application and visible in nvidia-smi.

NVIDIA Tools Extension (NVTX)

torch.cuda.nvtx.mark(msg)[source]

Describe an instantaneous event that occurred at some point.

Parameters:msg (string) – ASCII message to associate with the event.
torch.cuda.nvtx.range_push(msg)[source]

Pushes a range onto a stack of nested range span. Returns zero-based depth of the range that is started.

Parameters:msg (string) – ASCII message to associate with range
torch.cuda.nvtx.range_pop()[source]

Pops a range off of a stack of nested range spans. Returns the zero-based depth of the range that is ended.