.. DO NOT EDIT. .. THIS FILE WAS AUTOMATICALLY GENERATED BY SPHINX-GALLERY. .. TO MAKE CHANGES, EDIT THE SOURCE PYTHON FILE: .. "generated_examples/basic_cuda_example.py" .. LINE NUMBERS ARE GIVEN BELOW. .. rst-class:: sphx-glr-example-title .. _sphx_glr_generated_examples_basic_cuda_example.py: Accelerated video decoding on GPUs with CUDA and NVDEC ================================================================ TorchCodec can use supported Nvidia hardware (see support matrix `here <https://developer.nvidia.com/video-encode-and-decode-gpu-support-matrix-new>`_) to speed-up video decoding. This is called "CUDA Decoding" and it uses Nvidia's `NVDEC hardware decoder <https://developer.nvidia.com/video-codec-sdk>`_ and CUDA kernels to respectively decompress and convert to RGB. CUDA Decoding can be faster than CPU Decoding for the actual decoding step and also for subsequent transform steps like scaling, cropping or rotating. This is because the decode step leaves the decoded tensor in GPU memory so the GPU doesn't have to fetch from main memory before running the transform steps. Encoded packets are often much smaller than decoded frames so CUDA decoding also uses less PCI-e bandwidth. When to and when not to use CUDA Decoding ----------------------------------------- CUDA Decoding can offer speed-up over CPU Decoding in a few scenarios: #. You are decoding a large resolution video #. You are decoding a large batch of videos that's saturating the CPU #. You want to do whole-image transforms like scaling or convolutions on the decoded tensors after decoding #. Your CPU is saturated and you want to free it up for other work Here are situations where CUDA Decoding may not make sense: #. You want bit-exact results compared to CPU Decoding #. You have small resolution videos and the PCI-e transfer latency is large #. Your GPU is already busy and CPU is not It's best to experiment with CUDA Decoding to see if it improves your use-case. With TorchCodec you can simply pass in a device parameter to the :class:`~torchcodec.decoders.VideoDecoder` class to use CUDA Decoding. Installing TorchCodec with CUDA Enabled --------------------------------------- Refer to the installation guide in the `README <https://github.com/pytorch/torchcodec#installing-cuda-enabled-torchcodec>`_. .. GENERATED FROM PYTHON SOURCE LINES 51-59 Checking if Pytorch has CUDA enabled ------------------------------------- .. note:: This tutorial requires FFmpeg libraries compiled with CUDA support. .. GENERATED FROM PYTHON SOURCE LINES 59-66 .. code-block:: Python import torch print(f"{torch.__version__=}") print(f"{torch.cuda.is_available()=}") print(f"{torch.cuda.get_device_properties(0)=}") .. rst-class:: sphx-glr-script-out .. code-block:: none torch.__version__='2.7.0.dev20250205+cu126' torch.cuda.is_available()=True torch.cuda.get_device_properties(0)=_CudaDeviceProperties(name='Tesla M60', major=5, minor=2, total_memory=7606MB, multi_processor_count=16, uuid=de4a564a-4a37-62ed-10ac-6e1818c37313, L2_cache_size=2MB) .. GENERATED FROM PYTHON SOURCE LINES 67-82 Downloading the video ------------------------------------- We will use the following video which has the following properties: - Codec: H.264 - Resolution: 960x540 - FPS: 29.97 - Pixel format: YUV420P .. raw:: html <video style="max-width: 100%" controls> <source src="https://download.pytorch.org/torchaudio/tutorial-assets/stream-api/NASAs_Most_Scientifically_Complex_Space_Observatory_Requires_Precision-MP4_small.mp4" type="video/mp4"> </video> .. GENERATED FROM PYTHON SOURCE LINES 82-91 .. code-block:: Python import urllib.request video_file = "video.mp4" urllib.request.urlretrieve( "https://download.pytorch.org/torchaudio/tutorial-assets/stream-api/NASAs_Most_Scientifically_Complex_Space_Observatory_Requires_Precision-MP4_small.mp4", video_file, ) .. rst-class:: sphx-glr-script-out .. code-block:: none ('video.mp4', <http.client.HTTPMessage object at 0x7f47267d9700>) .. GENERATED FROM PYTHON SOURCE LINES 92-97 CUDA Decoding using VideoDecoder ------------------------------------- To use CUDA decoder, you need to pass in a cuda device to the decoder. .. GENERATED FROM PYTHON SOURCE LINES 97-102 .. code-block:: Python from torchcodec.decoders import VideoDecoder decoder = VideoDecoder(video_file, device="cuda") frame = decoder[0] .. GENERATED FROM PYTHON SOURCE LINES 103-104 The video frames are decoded and returned as tensor of NCHW format. .. GENERATED FROM PYTHON SOURCE LINES 105-108 .. code-block:: Python print(frame.shape, frame.dtype) .. rst-class:: sphx-glr-script-out .. code-block:: none torch.Size([3, 540, 960]) torch.uint8 .. GENERATED FROM PYTHON SOURCE LINES 109-110 The video frames are left on the GPU memory. .. GENERATED FROM PYTHON SOURCE LINES 111-115 .. code-block:: Python print(frame.data.device) .. rst-class:: sphx-glr-script-out .. code-block:: none cuda:0 .. GENERATED FROM PYTHON SOURCE LINES 116-121 Visualizing Frames ------------------------------------- Let's look at the frames decoded by CUDA decoder and compare them against equivalent results from the CPU decoders. .. GENERATED FROM PYTHON SOURCE LINES 121-149 .. code-block:: Python timestamps = [12, 19, 45, 131, 180] cpu_decoder = VideoDecoder(video_file, device="cpu") cuda_decoder = VideoDecoder(video_file, device="cuda") cpu_frames = cpu_decoder.get_frames_played_at(timestamps).data cuda_frames = cuda_decoder.get_frames_played_at(timestamps).data def plot_cpu_and_cuda_frames(cpu_frames: torch.Tensor, cuda_frames: torch.Tensor): try: import matplotlib.pyplot as plt from torchvision.transforms.v2.functional import to_pil_image except ImportError: print("Cannot plot, please run `pip install torchvision matplotlib`") return n_rows = len(timestamps) fig, axes = plt.subplots(n_rows, 2, figsize=[12.8, 16.0]) for i in range(n_rows): axes[i][0].imshow(to_pil_image(cpu_frames[i].to("cpu"))) axes[i][1].imshow(to_pil_image(cuda_frames[i].to("cpu"))) axes[0][0].set_title("CPU decoder", fontsize=24) axes[0][1].set_title("CUDA decoder", fontsize=24) plt.setp(axes, xticks=[], yticks=[]) plt.tight_layout() plot_cpu_and_cuda_frames(cpu_frames, cuda_frames) .. image-sg:: /generated_examples/images/sphx_glr_basic_cuda_example_001.png :alt: CPU decoder, CUDA decoder :srcset: /generated_examples/images/sphx_glr_basic_cuda_example_001.png :class: sphx-glr-single-img .. GENERATED FROM PYTHON SOURCE LINES 150-153 They look visually similar to the human eye but there may be subtle differences because CUDA math is not bit-exact with respect to CPU math. .. GENERATED FROM PYTHON SOURCE LINES 154-162 .. code-block:: Python frames_equal = torch.equal(cpu_frames.to("cuda"), cuda_frames) mean_abs_diff = torch.mean( torch.abs(cpu_frames.float().to("cuda") - cuda_frames.float()) ) max_abs_diff = torch.max(torch.abs(cpu_frames.to("cuda").float() - cuda_frames.float())) print(f"{frames_equal=}") print(f"{mean_abs_diff=}") print(f"{max_abs_diff=}") .. rst-class:: sphx-glr-script-out .. code-block:: none frames_equal=False mean_abs_diff=tensor(0.5636, device='cuda:0') max_abs_diff=tensor(2., device='cuda:0') .. rst-class:: sphx-glr-timing **Total running time of the script:** (0 minutes 6.910 seconds) .. _sphx_glr_download_generated_examples_basic_cuda_example.py: .. only:: html .. container:: sphx-glr-footer sphx-glr-footer-example .. container:: sphx-glr-download sphx-glr-download-jupyter :download:`Download Jupyter notebook: basic_cuda_example.ipynb <basic_cuda_example.ipynb>` .. container:: sphx-glr-download sphx-glr-download-python :download:`Download Python source code: basic_cuda_example.py <basic_cuda_example.py>` .. container:: sphx-glr-download sphx-glr-download-zip :download:`Download zipped: basic_cuda_example.zip <basic_cuda_example.zip>` .. only:: html .. rst-class:: sphx-glr-signature `Gallery generated by Sphinx-Gallery <https://sphinx-gallery.github.io>`_