How to sample video clips

In this example, we’ll learn how to sample video clips from a video. A clip generally denotes a sequence or batch of frames, and is typically passed as input to video models.

First, a bit of boilerplate: we’ll download a video from the web, and define a plotting utility. You can ignore that part and jump right below to Creating a decoder.

from typing import Optional
import torch
import requests


# Video source: https://www.pexels.com/video/dog-eating-854132/
# License: CC0. Author: Coverr.
url = "https://videos.pexels.com/video-files/854132/854132-sd_640_360_25fps.mp4"
response = requests.get(url, headers={"User-Agent": ""})
if response.status_code != 200:
    raise RuntimeError(f"Failed to download video. {response.status_code = }.")

raw_video_bytes = response.content


def plot(frames: torch.Tensor, title : Optional[str] = None):
    try:
        from torchvision.utils import make_grid
        from torchvision.transforms.v2.functional import to_pil_image
        import matplotlib.pyplot as plt
    except ImportError:
        print("Cannot plot, please run `pip install torchvision matplotlib`")
        return

    plt.rcParams["savefig.bbox"] = 'tight'
    fig, ax = plt.subplots()
    ax.imshow(to_pil_image(make_grid(frames)))
    ax.set(xticklabels=[], yticklabels=[], xticks=[], yticks=[])
    if title is not None:
        ax.set_title(title)
    plt.tight_layout()

Creating a decoder

Sampling clips from a video always starts by creating a VideoDecoder object. If you’re not already familiar with VideoDecoder, take a quick look at: Decoding a video with VideoDecoder.

from torchcodec.decoders import VideoDecoder

# You can also pass a path to a local file!
decoder = VideoDecoder(raw_video_bytes)

Sampling basics

We can now use our decoder to sample clips. Let’s first look at a simple example: all other samplers follow similar APIs and principles. We’ll use clips_at_random_indices() to sample clips that start at random indices.

from torchcodec.samplers import clips_at_random_indices

# The samplers RNG is controlled by pytorch's RNG. We set a seed for this
# tutorial to be reproducible across runs, but note that hard-coding a seed for
# a training run is generally not recommended.
torch.manual_seed(0)

clips = clips_at_random_indices(
    decoder,
    num_clips=5,
    num_frames_per_clip=4,
    num_indices_between_frames=3,
)
clips

FrameBatch:
  data (shape): torch.Size([5, 4, 3, 360, 640])
  pts_seconds: tensor([[11.3600, 11.4800, 11.6000, 11.7200],
        [10.2000, 10.3200, 10.4400, 10.5600],
        [ 9.8000,  9.9200, 10.0400, 10.1600],
        [ 9.6000,  9.7200,  9.8400,  9.9600],
        [ 8.4400,  8.5600,  8.6800,  8.8000]], dtype=torch.float64)
  duration_seconds: tensor([[0.0400, 0.0400, 0.0400, 0.0400],
        [0.0400, 0.0400, 0.0400, 0.0400],
        [0.0400, 0.0400, 0.0400, 0.0400],
        [0.0400, 0.0400, 0.0400, 0.0400],
        [0.0400, 0.0400, 0.0400, 0.0400]], dtype=torch.float64)

The output of the sampler is a sequence of clips, represented as FrameBatch object. In this object, we have different fields:

data: a 5D uint8 tensor representing the frame data. Its shape is (num_clips, num_frames_per_clip, …) where … is either (C, H, W) or (H, W, C), depending on the dimension_order parameter of the VideoDecoder. This is typically what would get passed to the model.
pts_seconds: a 2D float tensor of shape (num_clips, num_frames_per_clip) giving the starting timestamps of each frame within each clip, in seconds.
duration_seconds: a 2D float tensor of shape (num_clips, num_frames_per_clip) giving the duration of each frame within each clip, in seconds.

plot(clips[0].data)

Indexing and manipulating clips

Clips are FrameBatch objects, and they support native pytorch indexing semantics (including fancy indexing). This makes it easy to filter clips based on a given criteria. For example, from the clips above we can easily filter out those who start after a specific timestamp:

clip_starts = clips.pts_seconds[:, 0]
clip_starts

tensor([11.3600, 10.2000,  9.8000,  9.6000,  8.4400], dtype=torch.float64)

clips_starting_after_five_seconds = clips[clip_starts > 5]
clips_starting_after_five_seconds

FrameBatch:
  data (shape): torch.Size([5, 4, 3, 360, 640])
  pts_seconds: tensor([[11.3600, 11.4800, 11.6000, 11.7200],
        [10.2000, 10.3200, 10.4400, 10.5600],
        [ 9.8000,  9.9200, 10.0400, 10.1600],
        [ 9.6000,  9.7200,  9.8400,  9.9600],
        [ 8.4400,  8.5600,  8.6800,  8.8000]], dtype=torch.float64)
  duration_seconds: tensor([[0.0400, 0.0400, 0.0400, 0.0400],
        [0.0400, 0.0400, 0.0400, 0.0400],
        [0.0400, 0.0400, 0.0400, 0.0400],
        [0.0400, 0.0400, 0.0400, 0.0400],
        [0.0400, 0.0400, 0.0400, 0.0400]], dtype=torch.float64)

every_other_clip = clips[::2]
every_other_clip

FrameBatch:
  data (shape): torch.Size([3, 4, 3, 360, 640])
  pts_seconds: tensor([[11.3600, 11.4800, 11.6000, 11.7200],
        [ 9.8000,  9.9200, 10.0400, 10.1600],
        [ 8.4400,  8.5600,  8.6800,  8.8000]], dtype=torch.float64)
  duration_seconds: tensor([[0.0400, 0.0400, 0.0400, 0.0400],
        [0.0400, 0.0400, 0.0400, 0.0400],
        [0.0400, 0.0400, 0.0400, 0.0400]], dtype=torch.float64)

Note

A more natural and efficient way to get clips after a given timestamp is to rely on the sampling range parameters, which we’ll cover later in Advanced parameters: sampling range.

Index-based and Time-based samplers

So far we’ve used clips_at_random_indices(). Torchcodec support additional samplers, which fall under two main categories:

Index-based samplers:

Time-based samplers:

All these samplers follow similar APIs and the time-based samplers have analogous parameters to the index-based ones. Both samplers types generally offer comparable performance in terms speed.

Note

Is it better to use a time-based sampler or an index-based sampler? The index-based samplers have arguably slightly simpler APIs and their behavior is possibly simpler to understand and control, because of the discrete nature of indices. For videos with constant fps, an index-based sampler behaves exactly the same as a time-based samplers. For videos with variable fps however (as is often the case), relying on indices may under/over sample some regions in the video, which may lead to undersirable side effects when training a model. Using a time-based sampler ensures uniform sampling caracteristics along the time-dimension.

Advanced parameters: sampling range

Sometimes, we may not want to sample clips from an entire video. We may only be interested in clips that start within a smaller interval. In samplers, the sampling_range_start and sampling_range_end parmeters control the sampling range: they define where we allow clips to start. There are two important things to keep in mind:

sampling_range_end is an open upper-bound: clips may only start within [sampling_range_start, sampling_range_end).
Because these parameter define where a clip can start, clips may contain frames after sampling_range_end!

from torchcodec.samplers import clips_at_regular_timestamps

clips = clips_at_regular_timestamps(
    decoder,
    seconds_between_clip_starts=1,
    num_frames_per_clip=4,
    seconds_between_frames=0.5,
    sampling_range_start=2,
    sampling_range_end=5
)
clips

FrameBatch:
  data (shape): torch.Size([3, 4, 3, 360, 640])
  pts_seconds: tensor([[2.0000, 2.4800, 3.0000, 3.4800],
        [3.0000, 3.4800, 4.0000, 4.4800],
        [4.0000, 4.4800, 5.0000, 5.4800]], dtype=torch.float64)
  duration_seconds: tensor([[0.0400, 0.0400, 0.0400, 0.0400],
        [0.0400, 0.0400, 0.0400, 0.0400],
        [0.0400, 0.0400, 0.0400, 0.0400]], dtype=torch.float64)

Advanced parameters: policy

Depending on the length or duration of the video and on the sampling parameters, the sampler may try to sample frames beyond the end of the video. The policy parameter defines how such invalid frames should be replaced with valid frames.

from torchcodec.samplers import clips_at_random_timestamps

end_of_video = decoder.metadata.end_stream_seconds
print(f"{end_of_video = }")

end_of_video = 13.8

torch.manual_seed(0)
clips = clips_at_random_timestamps(
    decoder,
    num_clips=1,
    num_frames_per_clip=5,
    seconds_between_frames=0.4,
    sampling_range_start=end_of_video - 1,
    sampling_range_end=end_of_video,
    policy="repeat_last",
)
clips.pts_seconds

tensor([[13.2800, 13.6800, 13.6800, 13.6800, 13.6800]], dtype=torch.float64)

We see above that the end of the video is at 13.8s. The sampler tries to sample frames at timestamps [13.28, 13.68, 14.08, …] but 14.08 is an invalid timestamp, beyond the end video. With the “repeat_last” policy, which is the default, the sampler simply repeats the last frame at 13.68 seconds to construct the clip.

An alternative policy is “wrap”: the sampler then wraps-around the clip and repeats the first few valid frames as necessary:

torch.manual_seed(0)
clips = clips_at_random_timestamps(
    decoder,
    num_clips=1,
    num_frames_per_clip=5,
    seconds_between_frames=0.4,
    sampling_range_start=end_of_video - 1,
    sampling_range_end=end_of_video,
    policy="wrap",
)
clips.pts_seconds

tensor([[13.2800, 13.6800, 13.2800, 13.6800, 13.2800]], dtype=torch.float64)

By default, the value of sampling_range_end is automatically set such that the sampler doesn’t try to sample frames beyond the end of the video: the default value ensures that clips start early enough before the end. This means that by default, the policy parameter rarely comes into action, and most users probably don’t need to worry too much about it.

Total running time of the script: (0 minutes 0.593 seconds)

Gallery generated by Sphinx-Gallery

How to sample video clips

Creating a decoder

Sampling basics

Indexing and manipulating clips

Index-based and Time-based samplers

Advanced parameters: sampling range

Advanced parameters: policy

Docs

Tutorials

Resources