• Docs >
  • torchaudio >
  • Nightly (unstable)
Shortcuts

torchaudio

I/O

torchaudio top-level module provides the following functions that make it easy to handle audio data.

Under the hood, these functions are implemented using various decoding/encoding libraries. There are currently three variants.

  • FFmpeg

  • libsox

  • SoundFile

libsox backend is the first backend implemented in TorchAudio, and it works on Linux and macOS. SoundFile backend was added to extend audio I/O support to Windows. It also works on Linux and macOS. FFmpeg backend is the latest addition and it supports wide range of audio, video formats and protocols. It works on Linux, macOS and Windows.

Introduction of Dispatcher

Conventionally, torchaudio has had its IO backend set globally at runtime based on availability. However, this approach does not allow applications to use different backends, and it is not well-suited for large codebases.

For these reasons, we are introducing a dispatcher, a new mechanism to allow users to choose a backend for each function call, and migrating the I/O functions. This incurs multiple changes, some of which involve backward-compatibility-breaking changes, and require users to change their function call.

The (planned) changes are as follows. For up-to-date information, please refer to https://github.com/pytorch/audio/issues/2950

  • In 2.0, audio I/O backend dispatcher was introduced. Users can opt-in to using dispatcher by setting the environment variable TORCHAUDIO_USE_BACKEND_DISPATCHER=1

  • In 2.1, the disptcher becomes the default mechanism for I/O. Those who need to keep using the previous mechanism (global backend) can do so by setting TORCHAUDIO_USE_BACKEND_DISPATCHER=0.

Furthermore, we are removing file-like object support from libsox backend, as this is better supported by FFmpeg backend and makes the build process simpler. Therefore, beginning with 2.1, FFmpeg and Soundfile are the sole backends that support file-like objects.

The changes in 2.1 will mark the backend utilities deprecated.

Current API

I/O functionalities

Audio I/O functions are implemented in torchaudio.backend module, but for the ease of use, the following functions are made available on torchaudio module. There are different backends available and you can switch backends with set_audio_backend().

Please refer to torchaudio.backend for the detail, and the Audio I/O tutorial for the usage.

torchaudio.info

torchaudio.info(filepath: str, ...)

Fetch meta data of an audio file. Refer to torchaudio.backend for the detail.

torchaudio.load

torchaudio.load(filepath: str, ...)

Load audio file into torch.Tensor object. Refer to torchaudio.backend for the detail.

torchaudio.save

torchaudio.save(filepath: str, src: torch.Tensor, sample_rate: int, ...)

Save torch.Tensor object into an audio format. Refer to torchaudio.backend for the detail.

Backend Utilities

The following functions are effective only when backend dispatcher is disabled. They are effectively deprecated.

torchaudio.list_audio_backends() List[str][source]

List available backends

Returns:

The list of available backends.

Return type:

List[str]

torchaudio.get_audio_backend() Optional[str][source]

Get the name of the current backend

Returns:

The name of the current backend or None if no backend is assigned.

Return type:

Optional[str]

torchaudio.set_audio_backend(backend: Optional[str])[source]

Set the backend for I/O operation

Parameters:

backend (str or None) – Name of the backend. One of "sox_io" or "soundfile" based on availability of the system. If None is provided the current backend is unassigned.

Future API

Dispatcher

The dispatcher tries to use the I/O backend in the following order of precedence

  1. FFmpeg

  2. libsox

  3. soundfile

One can pass backend argument to I/O functions to override this.

See Future API for details on the new API.

In the next release, each of torchaudio.info, torchaudio.load, and torchaudio.save will allow for selecting a backend to use via parameter backend. The functions will support using any of FFmpeg, SoX, and SoundFile, provided that the corresponding library is installed. If a backend is not explicitly chosen, the functions will select a backend to use given order of precedence (FFmpeg, SoX, SoundFile) and library availability.

Note that only FFmpeg and SoundFile will support file-like objects.

These functions can be enabled in the current release by setting environment variable TORCHAUDIO_USE_BACKEND_DISPATCHER=1.

torchaudio.info

torchaudio.info(uri: Union[BinaryIO, str, PathLike], format: Optional[str] = None, buffer_size: int = 4096, backend: Optional[str] = None) AudioMetaData

Get signal information of an audio file.

Parameters:
  • uri (path-like object or file-like object) –

    Source of audio data. The following types are accepted:

    • path-like: file path

    • file-like: Object with read(size: int) -> bytes method, which returns byte string of at most size length.

    Note

    When the input type is file-like object, this function cannot get the correct length (num_samples) for certain formats, such as vorbis. In this case, the value of num_samples is 0.

  • format (str or None, optional) – If not None, interpreted as hint that may allow backend to override the detected format. (Default: None)

  • buffer_size (int, optional) – Size of buffer to use when processing file-like objects, in bytes. (Default: 4096)

  • backend (str or None, optional) – I/O backend to use. If None, function selects backend given input and available backends. Otherwise, must be one of [“ffmpeg”, “sox”, “soundfile”], with the corresponding backend available. (Default: None)

Returns:

Metadata of the given audio.

Return type:

AudioMetaData

torchaudio.load

torchaudio.load(uri: Union[BinaryIO, str, PathLike], frame_offset: int = 0, num_frames: int = -1, normalize: bool = True, channels_first: bool = True, format: Optional[str] = None, buffer_size: int = 4096, backend: Optional[str] = None) Tuple[Tensor, int]

Load audio data from file.

Note

The formats this function can handle depend on backend availability. This function is tested on the following formats:

  • WAV

    • 32-bit floating-point

    • 32-bit signed integer

    • 24-bit signed integer

    • 16-bit signed integer

    • 8-bit unsigned integer

  • FLAC

  • OGG/VORBIS

  • SPHERE

By default (normalize=True, channels_first=True), this function returns Tensor with float32 dtype, and the shape of [channel, time].

Warning

normalize argument does not perform volume normalization. It only converts the sample type to torch.float32 from the native sample type.

When the input format is WAV with integer type, such as 32-bit signed integer, 16-bit signed integer, 24-bit signed integer, and 8-bit unsigned integer, by providing normalize=False, this function can return integer Tensor, where the samples are expressed within the whole range of the corresponding dtype, that is, int32 tensor for 32-bit signed PCM, int16 for 16-bit signed PCM and uint8 for 8-bit unsigned PCM. Since torch does not support int24 dtype, 24-bit signed PCM are converted to int32 tensors.

normalize argument has no effect on 32-bit floating-point WAV and other formats, such as flac and mp3.

For these formats, this function always returns float32 Tensor with values.

Parameters:
  • uri (path-like object or file-like object) – Source of audio data.

  • frame_offset (int, optional) – Number of frames to skip before start reading data.

  • num_frames (int, optional) – Maximum number of frames to read. -1 reads all the remaining samples, starting from frame_offset. This function may return the less number of frames if there is not enough frames in the given file.

  • normalize (bool, optional) –

    When True, this function converts the native sample type to float32. Default: True.

    If input file is integer WAV, giving False will change the resulting Tensor type to integer type. This argument has no effect for formats other than integer WAV type.

  • channels_first (bool, optional) – When True, the returned Tensor has dimension [channel, time]. Otherwise, the returned Tensor’s dimension is [time, channel].

  • format (str or None, optional) – If not None, interpreted as hint that may allow backend to override the detected format. (Default: None)

  • buffer_size (int, optional) – Size of buffer to use when processing file-like objects, in bytes. (Default: 4096)

  • backend (str or None, optional) – I/O backend to use. If None, function selects backend given input and available backends. Otherwise, must be one of [“ffmpeg”, “sox”, “soundfile”], with the corresponding backend being available. (Default: None)

Returns:

Resulting Tensor and sample rate.

If the input file has integer wav format and normalization is off, then it has integer type, else float32 type. If channels_first=True, it has [channel, time] else [time, channel].

Return type:

(torch.Tensor, int)

torchaudio.save

torchaudio.save(uri: Union[BinaryIO, str, PathLike], src: Tensor, sample_rate: int, channels_first: bool = True, format: Optional[str] = None, encoding: Optional[str] = None, bits_per_sample: Optional[int] = None, buffer_size: int = 4096, backend: Optional[str] = None)

Save audio data to file.

Note

The formats this function can handle depend on the availability of backends. This function is tested on the following formats:

  • WAV

    • 32-bit floating-point

    • 32-bit signed integer

    • 16-bit signed integer

    • 8-bit unsigned integer

  • FLAC

  • OGG/VORBIS

Parameters:
  • uri (str or pathlib.Path) – Path to audio file.

  • src (torch.Tensor) – Audio data to save. must be 2D tensor.

  • sample_rate (int) – sampling rate

  • channels_first (bool, optional) – If True, the given tensor is interpreted as [channel, time], otherwise [time, channel].

  • format (str or None, optional) –

    Override the audio format. When uri argument is path-like object, audio format is inferred from file extension. If the file extension is missing or different, you can specify the correct format with this argument.

    When uri argument is file-like object, this argument is required.

    Valid values are "wav", "ogg", and "flac".

  • encoding (str or None, optional) –

    Changes the encoding for supported formats. This argument is effective only for supported formats, i.e. "wav" and ""flac"`. Valid values are

    • "PCM_S" (signed integer Linear PCM)

    • "PCM_U" (unsigned integer Linear PCM)

    • "PCM_F" (floating point PCM)

    • "ULAW" (mu-law)

    • "ALAW" (a-law)

  • bits_per_sample (int or None, optional) – Changes the bit depth for the supported formats. When format is one of "wav" and "flac", you can change the bit depth. Valid values are 8, 16, 24, 32 and 64.

  • buffer_size (int, optional) – Size of buffer to use when processing file-like objects, in bytes. (Default: 4096)

  • backend (str or None, optional) – I/O backend to use. If None, function selects backend given input and available backends. Otherwise, must be one of [“ffmpeg”, “sox”, “soundfile”], with the corresponding backend being available. (Default: None)

Supported formats/encodings/bit depth/compression are:

"wav"
  • 32-bit floating-point PCM

  • 32-bit signed integer PCM

  • 24-bit signed integer PCM

  • 16-bit signed integer PCM

  • 8-bit unsigned integer PCM

  • 8-bit mu-law

  • 8-bit a-law

Note:

Default encoding/bit depth is determined by the dtype of the input Tensor.

"flac"
  • 16-bit (default)

  • 24-bit

"ogg"
  • Doesn’t accept changing configuration.

Docs

Access comprehensive developer documentation for PyTorch

View Docs

Tutorials

Get in-depth tutorials for beginners and advanced developers

View Tutorials

Resources

Find development resources and get your questions answered

View Resources