Exact vs Approximate seek mode: Performance and accuracy comparison

In this example, we will describe the seek_mode parameter of the VideoDecoder class. This parameter offers a trade-off between the speed of the VideoDecoder creation, against the seeking accuracy of the retreived frames (i.e. in approximate mode, requesting the i’th frame may not necessarily return frame i).

First, a bit of boilerplate: we’ll download a short video from the web, and use the ffmpeg CLI to repeat it 100 times. We’ll end up with two videos: a short video of approximately 13s and a long one of about 20 mins. You can ignore that part and jump right below to Performance: VideoDecoder creation.

import torch
import requests
import tempfile
from pathlib import Path
import shutil
import subprocess
from time import perf_counter_ns


# Video source: https://www.pexels.com/video/dog-eating-854132/
# License: CC0. Author: Coverr.
url = "https://videos.pexels.com/video-files/854132/854132-sd_640_360_25fps.mp4"
response = requests.get(url, headers={"User-Agent": ""})
if response.status_code != 200:
    raise RuntimeError(f"Failed to download video. {response.status_code = }.")

temp_dir = tempfile.mkdtemp()
short_video_path = Path(temp_dir) / "short_video.mp4"
with open(short_video_path, 'wb') as f:
    for chunk in response.iter_content():
        f.write(chunk)

long_video_path = Path(temp_dir) / "long_video.mp4"
ffmpeg_command = [
    "ffmpeg",
    "-stream_loop", "99",  # repeat video 100 times
    "-i", f"{short_video_path}",
    "-c", "copy",
    f"{long_video_path}"
]
subprocess.run(ffmpeg_command, check=True, stdout=subprocess.PIPE, stderr=subprocess.PIPE)

from torchcodec.decoders import VideoDecoder
print(f"Short video duration: {VideoDecoder(short_video_path).metadata.duration_seconds} seconds")
print(f"Long video duration: {VideoDecoder(long_video_path).metadata.duration_seconds / 60} minutes")

Short video duration: 13.8 seconds
Long video duration: 23.0 minutes

Performance: `VideoDecoder` creation

In terms of performance, the seek_mode parameter ultimately affects the creation of a VideoDecoder object. The longer the video, the higher the performance gain.

def bench(f, average_over=50, warmup=2, **f_kwargs):

    for _ in range(warmup):
        f(**f_kwargs)

    times = []
    for _ in range(average_over):
        start = perf_counter_ns()
        f(**f_kwargs)
        end = perf_counter_ns()
        times.append(end - start)

    times = torch.tensor(times) * 1e-6  # ns to ms
    std = times.std().item()
    med = times.median().item()
    print(f"{med = :.2f}ms +- {std:.2f}")


print("Creating a VideoDecoder object with seek_mode='exact' on a short video:")
bench(VideoDecoder, source=short_video_path, seek_mode="exact")
print("Creating a VideoDecoder object with seek_mode='approximate' on a short video:")
bench(VideoDecoder, source=short_video_path, seek_mode="approximate")
print()
print("Creating a VideoDecoder object with seek_mode='exact' on a long video:")
bench(VideoDecoder, source=long_video_path, seek_mode="exact")
print("Creating a VideoDecoder object with seek_mode='approximate' on a long video:")
bench(VideoDecoder, source=long_video_path, seek_mode="approximate")

Creating a VideoDecoder object with seek_mode='exact' on a short video:
med = 8.04ms +- 0.03
Creating a VideoDecoder object with seek_mode='approximate' on a short video:
med = 7.09ms +- 0.10

Creating a VideoDecoder object with seek_mode='exact' on a long video:
med = 114.68ms +- 0.73
Creating a VideoDecoder object with seek_mode='approximate' on a long video:
med = 10.52ms +- 0.03

Performance: frame decoding and clip sampling

Strictly speaking the seek_mode parameter only affects the performance of the VideoDecoder creation. It does not have a direct effect on the performance of frame decoding or sampling. However, because frame decoding and sampling patterns typically involve the creation of the VideoDecoder (one per video), seek_mode may very well end up affecting the performance of decoding and samplers. For example:

from torchcodec import samplers


def sample_clips(seek_mode):
    return samplers.clips_at_random_indices(
        decoder=VideoDecoder(
            source=long_video_path,
            seek_mode=seek_mode
        ),
        num_clips=5,
        num_frames_per_clip=2,
    )


print("Sampling clips with seek_mode='exact':")
bench(sample_clips, seek_mode="exact")
print("Sampling clips with seek_mode='approximate':")
bench(sample_clips, seek_mode="approximate")

Sampling clips with seek_mode='exact':
med = 299.06ms +- 32.15
Sampling clips with seek_mode='approximate':
med = 183.01ms +- 44.95

Accuracy: Metadata and frame retrieval

We’ve seen that using seek_mode="approximate" can significantly speed up the VideoDecoder creation. The price to pay for that is that seeking won’t always be as accurate as with seek_mode="exact". It can also affect the exactness of the metadata.

However, in a lot of cases, you’ll find that there will be no accuracy difference between the two modes, which means that seek_mode="approximate" is a net win:

print("Metadata of short video with seek_mode='exact':")
print(VideoDecoder(short_video_path, seek_mode="exact").metadata)
print("Metadata of short video with seek_mode='approximate':")
print(VideoDecoder(short_video_path, seek_mode="approximate").metadata)

exact_decoder = VideoDecoder(short_video_path, seek_mode="exact")
approx_decoder = VideoDecoder(short_video_path, seek_mode="approximate")
for i in range(len(exact_decoder)):
    torch.testing.assert_close(
        exact_decoder.get_frame_at(i).data,
        approx_decoder.get_frame_at(i).data,
        atol=0, rtol=0,
    )
print("Frame seeking is the same for this video!")

Metadata of short video with seek_mode='exact':
VideoStreamMetadata:
  num_frames: 345
  duration_seconds: 13.8
  average_fps: 25.0
  duration_seconds_from_header: 13.8
  bit_rate: 505790.0
  num_frames_from_header: 345
  num_frames_from_content: 345
  begin_stream_seconds_from_content: 0.0
  end_stream_seconds_from_content: 13.8
  codec: h264
  width: 640
  height: 360
  average_fps_from_header: 25.0
  stream_index: 0

Metadata of short video with seek_mode='approximate':
VideoStreamMetadata:
  num_frames: 345
  duration_seconds: 13.8
  average_fps: 25.0
  duration_seconds_from_header: 13.8
  bit_rate: 505790.0
  num_frames_from_header: 345
  num_frames_from_content: None
  begin_stream_seconds_from_content: None
  end_stream_seconds_from_content: None
  codec: h264
  width: 640
  height: 360
  average_fps_from_header: 25.0
  stream_index: 0

Frame seeking is the same for this video!

What is this doing under the hood?

With seek_mode="exact", the VideoDecoder performs a scan when it is instantiated. The scan doesn’t involve decoding, but processes an entire file to infer more accurate metadata (like duration), and also builds an internal index of frames and key-frames. This internal index is potentially more accurate than the one in the file’s headers, which leads to more accurate seeking behavior. Without the scan, TorchCodec relies only on the metadata contained in the file, which may not always be as accurate.

Which mode should I use?

The general rule of thumb is as follows:

If you really care about exactness of frame seeking, use “exact”.
If you can sacrifice exactness of seeking for speed, which is usually the case when doing clip sampling, use “approximate”.
If your videos don’t have variable framerate and their metadata is correct, then “approximate” mode is a net win: it will be just as accurate as the “exact” mode while still being significantly faster.

shutil.rmtree(temp_dir)

Total running time of the script: (0 minutes 35.689 seconds)

Gallery generated by Sphinx-Gallery

Exact vs Approximate seek mode: Performance and accuracy comparison

Performance: `VideoDecoder` creation

Performance: frame decoding and clip sampling

Accuracy: Metadata and frame retrieval

What is this doing under the hood?

Which mode should I use?

Docs

Tutorials

Resources