torchaudio.backend

Overview

torchaudio.backend module provides implementations for audio file I/O functionalities, which are torchaudio.info, torchaudio.load, torchaudio.load_wav and torchaudio.save.

There are currently four implementations available.

“sox_io” (default on Linux/macOS)
“sox” (deprecated, will be removed in 0.9.0 release)
“soundfile” (default on Windows)
“soundfile” (legacy interface) (deprecated, will be removed in 0.9.0 release)

The use of "sox" backend is strongly discouraged as it cannot correctly handle formats other than 16-bit integer WAV. See #726 for the detail.

Note

Instead of calling functions in torchaudio.backend directly, please use torchaudio.info, torchaudio.load, torchaudio.load_wav and torchaudio.save with proper backend set with torchaudio.set_audio_backend().

Availability

"sox" and "sox_io" backends require C++ extension module, which is included in Linux/macOS binary distributions. These backends are not available on Windows.

"soundfile" backend requires SoundFile. Please refer to the SoundFile documentation for the installation.

Changes in default backend and deprecation

Backend module is going through a major overhaul. The following table summarizes the timeline for the deprecations and removals.

Backend

0.8.0

0.9.0

"sox_io"

Default on Linx/macOS

Default on Linux/macOS

"sox" (deprecated)

Available

Removed

"soundfile"

Default on Windows

Default on Windows

"soundfile" (legacy interface, deprecated)

Available

Removed

The "sox" and "soundfile" (legacy interface) backends are deprecated and will be removed in 0.9.0 release.

Common Data Structure

Structures used to report the metadata of audio files.

AudioMetaData

class torchaudio.backend.common.AudioMetaData(sample_rate: int, num_frames: int, num_channels: int, bits_per_sample: int, encoding: str)[source]

Return type of torchaudio.info function.

This class is used by “sox_io” backend and “soundfile” backend with the new interface.

Variables

sample_rate (int) – Sample rate
num_frames (int) – The number of frames
num_channels (int) – The number of channels
bits_per_sample (int) – The number of bits per sample. This is 0 for lossy formats, or when it cannot be accurately inferred.
encoding (str) – Audio encoding.

SignalInfo (Deprecated)

class torchaudio.backend.common.SignalInfo(channels: Optional[int] = None, rate: Optional[float] = None, precision: Optional[int] = None, length: Optional[int] = None)[source]

One of return types of torchaudio.info functions.

This class is used by “sox” backend (deprecated) and “soundfile” backend with the legacy interface (deprecated).

See https://fossies.org/dox/sox-14.4.2/structsox__signalinfo__t.html

Variables

channels (Optional[int]) – The number of channels
rate (Optional[float]) – Sampleing rate
precision (Optional[int]) – Bit depth
length (Optional[int]) – For sox backend, the number of samples. (frames * channels). For soundfile backend, the number of frames.

EncodingInfo (Deprecated)

class torchaudio.backend.common.EncodingInfo(encoding: Any = None, bits_per_sample: Optional[int] = None, compression: Optional[float] = None, reverse_bytes: Any = None, reverse_nibbles: Any = None, reverse_bits: Any = None, opposite_endian: Optional[bool] = None)[source]

One of return types of torchaudio.info functions.

This class is used by “sox” backend (deprecated) and “soundfile” backend with the legacy interface (deprecated).

See https://fossies.org/dox/sox-14.4.2/structsox__encodinginfo__t.html

Variables

encoding (Optional[int]) – sox_encoding_t
bits_per_sample (Optional[int]) – bit depth
compression (Optional[float]) – Compression option
reverse_bytes (Any) –
reverse_nibbles (Any) –
reverse_bits (Any) –
opposite_endian (Optional[bool]) –

Sox IO Backend

The "sox_io" backend is available and default on Linux/macOS and not available on Windows.

I/O functions of this backend support TorchScript.

You can switch from another backend to the sox_io backend with the following;

torchaudio.set_audio_backend("sox_io")

info

torchaudio.backend.sox_io_backend.info(filepath: str, format: Optional[str] = None) → torchaudio.backend.common.AudioMetaData[source]

Get signal information of an audio file.

Parameters

filepath (path-like object or file-like object) –
Source of audio data. When the function is not compiled by TorchScript, (e.g. torch.jit.script), the following types are accepted;
- path-like: file path
- file-like: Object with read(size: int) -> bytes method, which returns byte string of at most size length.
When the function is compiled by TorchScript, only str type is allowed.
Note
- When the input type is file-like object, this function cannot get the correct length (num_samples) for certain formats, such as mp3 and vorbis. In this case, the value of num_samples is 0.
- This argument is intentionally annotated as str only due to TorchScript compiler compatibility.
format (str, optional) – Override the format detection with the given format. Providing the argument might help when libsox can not infer the format from header or extension,

Returns

Metadata of the given audio.

Return type

AudioMetaData

load

torchaudio.backend.sox_io_backend.load(filepath: str, frame_offset: int = 0, num_frames: int = -1, normalize: bool = True, channels_first: bool = True, format: Optional[str] = None) → Tuple[torch.Tensor, int][source]

Load audio data from file.

Note

This function can handle all the codecs that underlying libsox can handle, however it is tested on the following formats;

WAV, AMB
- 32-bit floating-point
- 32-bit signed integer
- 16-bit signed integer
- 8-bit unsigned integer (WAV only)
MP3
FLAC
OGG/VORBIS
OPUS
SPHERE
AMR-NB

To load MP3, FLAC, OGG/VORBIS, OPUS and other codecs libsox does not handle natively, your installation of torchaudio has to be linked to libsox and corresponding codec libraries such as libmad or libmp3lame etc.

By default (normalize=True, channels_first=True), this function returns Tensor with float32 dtype and the shape of [channel, time]. The samples are normalized to fit in the range of [-1.0, 1.0].

When the input format is WAV with integer type, such as 32-bit signed integer, 16-bit signed integer and 8-bit unsigned integer (24-bit signed integer is not supported), by providing normalize=False, this function can return integer Tensor, where the samples are expressed within the whole range of the corresponding dtype, that is, int32 tensor for 32-bit signed PCM, int16 for 16-bit signed PCM and uint8 for 8-bit unsigned PCM.

normalize parameter has no effect on 32-bit floating-point WAV and other formats, such as flac and mp3. For these formats, this function always returns float32 Tensor with values normalized to [-1.0, 1.0].

Parameters

filepath (path-like object or file-like object) –
Source of audio data. When the function is not compiled by TorchScript, (e.g. torch.jit.script), the following types are accepted;
- path-like: file path
- file-like: Object with read(size: int) -> bytes method, which returns byte string of at most size length.
When the function is compiled by TorchScript, only str type is allowed.

Note: This argument is intentionally annotated as str only due to TorchScript compiler compatibility.
frame_offset (int) – Number of frames to skip before start reading data.
num_frames (int) – Maximum number of frames to read. -1 reads all the remaining samples, starting from frame_offset. This function may return the less number of frames if there is not enough frames in the given file.
normalize (bool) – When True, this function always return float32, and sample values are normalized to [-1.0, 1.0]. If input file is integer WAV, giving False will change the resulting Tensor type to integer type. This argument has no effect for formats other than integer WAV type.
channels_first (bool) – When True, the returned Tensor has dimension [channel, time]. Otherwise, the returned Tensor’s dimension is [time, channel].
format (str, optional) – Override the format detection with the given format. Providing the argument might help when libsox can not infer the format from header or extension,

Returns

Resulting Tensor and sample rate.: If the input file has integer wav format and normalization is off, then it has integer type, else float32 type. If channels_first=True, it has [channel, time] else [time, channel].

Return type

Tuple[torch.Tensor, int]

torchaudio.backend.sox_io_backend.load_wav(filepath: str, frame_offset: int = 0, num_frames: int = -1, channels_first: bool = True) → Tuple[torch.Tensor, int][source]

Load wave file.

This function is defined only for the purpose of compatibility against other backend for simple usecases, such as torchaudio.load_wav(filepath). The implementation is same as load().

save

torchaudio.backend.sox_io_backend.save(filepath: str, src: torch.Tensor, sample_rate: int, channels_first: bool = True, compression: Optional[float] = None, format: Optional[str] = None, encoding: Optional[str] = None, bits_per_sample: Optional[int] = None)[source]

Save audio data to file.

Parameters

filepath (str or pathlib.Path) – Path to save file. This function also handles pathlib.Path objects, but is annotated as str for TorchScript compiler compatibility.
src (torch.Tensor) – Audio data to save. must be 2D tensor.
sample_rate (int) – sampling rate
channels_first (bool) – If True, the given tensor is interpreted as [channel, time], otherwise [time, channel].
compression (Optional[float]) –
Used for formats other than WAV. This corresponds to -C option of sox command.

"mp3"
Either bitrate (in kbps) with quality factor, such as 128.2, or VBR encoding with quality factor such as -4.2. Default: -4.5.

"flac"
Whole number from 0 to 8. 8 is default and highest compression.

"ogg", "vorbis"
Number from -1 to 10; -1 is the highest compression and lowest quality. Default: 3.

See the detail at http://sox.sourceforge.net/soxformat.html.
format (str, optional) –
Override the audio format. When filepath argument is path-like object, audio format is infered from file extension. If file extension is missing or different, you can specify the correct format with this argument.

When filepath argument is file-like object, this argument is required.

Valid values are "wav", "mp3", "ogg", "vorbis", "amr-nb", "amb", "flac", "sph", "gsm", and "htk".
encoding (str, optional) –
Changes the encoding for the supported formats. This argument is effective only for supported formats, such as "wav", ""amb" and "sph". Valid values are;
- "PCM_S" (signed integer Linear PCM)
- "PCM_U" (unsigned integer Linear PCM)
- "PCM_F" (floating point PCM)
- "ULAW" (mu-law)
- "ALAW" (a-law)
Default values
If not provided, the default value is picked based on format and bits_per_sample.
"wav", "amb"
If both encoding and bits_per_sample are not provided, the dtype of the

Tensor is used to determine the default value. - "PCM_U" if dtype is uint8 - "PCM_S" if dtype is int16 or int32` - ``"PCM_F" if dtype is float32

"PCM_U" if bits_per_sample=8

"PCM_S" otherwise
"sph" format;
the default value is "PCM_S"
bits_per_sample (int, optional) –
Changes the bit depth for the supported formats. When format is one of "wav", "flac", "sph", or "amb", you can change the bit depth. Valid values are 8, 16, 32 and 64.
Default Value;
If not provided, the default values are picked based on format and "encoding";
"wav", "amb";
If both encoding and bits_per_sample are not provided, the dtype of the

Tensor is used. - 8 if dtype is uint8 - 16 if dtype is int16 - 32 if dtype is int32 or float32

8 if encoding is "PCM_U", "ULAW" or "ALAW"

16 if encoding is "PCM_S"

32 if encoding is "PCM_F"
"flac" format;
the default value is 24
"sph" format;
16 if encoding is "PCM_U", "PCM_S", "PCM_F" or not provided.

8 if encoding is "ULAW" or "ALAW"
"amb" format;
8 if encoding is "PCM_U", "ULAW" or "ALAW"

16 if encoding is "PCM_S" or not provided.

32 if encoding is "PCM_F"

Supported formats/encodings/bit depth/compression are;

"wav", "amb"

32-bit floating-point PCM
32-bit signed integer PCM
24-bit signed integer PCM
16-bit signed integer PCM
8-bit unsigned integer PCM
8-bit mu-law
8-bit a-law

Note: Default encoding/bit depth is determined by the dtype of the input Tensor.

"mp3"

Fixed bit rate (such as 128kHz) and variable bit rate compression. Default: VBR with high quality.

"flac"

8-bit
16-bit
24-bit (default)

"ogg", "vorbis"

Different quality level. Default: approx. 112kbps

"sph"

8-bit signed integer PCM
16-bit signed integer PCM
24-bit signed integer PCM
32-bit signed integer PCM (default)
8-bit mu-law
8-bit a-law
16-bit a-law
24-bit a-law
32-bit a-law

"amr-nb"

Bitrate ranging from 4.75 kbit/s to 12.2 kbit/s. Default: 4.75 kbit/s

"gsm"

Lossy Speech Compression, CPU intensive.

"htk"

Uses its default single-channel 16-bit PCM format.

Note

To save into formats that libsox does not handle natively, (such as "mp3", "flac", "ogg" and "vorbis"), your installation of torchaudio has to be linked to libsox and corresponding codec libraries such as libmad or libmp3lame etc.

Sox Backend (Deprecated)

The "sox" backend is available on Linux/macOS and not available on Windows. This backend is deprecated and will be removed in 0.9.0 release.

You can switch from another backend to sox backend with the following;

torchaudio.set_audio_backend("sox")

info

torchaudio.backend.sox_backend.info(filepath: str) → Tuple[torchaudio.backend.common.SignalInfo, torchaudio.backend.common.EncodingInfo][source]

Gets metadata from an audio file without loading the signal.

Parameters

filepath – Path to audio file

Returns

A si (sox_signalinfo_t) signal: info as a python object. An ei (sox_encodinginfo_t) encoding info

Return type

(sox_signalinfo_t, sox_encodinginfo_t)

Example

>>> si, ei = torchaudio.info('foo.wav')
>>> rate, channels, encoding = si.rate, si.channels, ei.encoding

load

torchaudio.backend.sox_backend.load(filepath: str, out: Optional[torch.Tensor] = None, normalization: bool = True, channels_first: bool = True, num_frames: int = 0, offset: int = 0, signalinfo: torchaudio.backend.common.SignalInfo = None, encodinginfo: torchaudio.backend.common.EncodingInfo = None, filetype: Optional[str] = None) → Tuple[torch.Tensor, int][source]

Loads an audio file from disk into a tensor

Parameters

filepath – Path to audio file
out – An optional output tensor to use instead of creating one. (Default: None)
normalization – Optional normalization. If boolean True, then output is divided by 1 << 31. Assuming the input is signed 32-bit audio, this normalizes to [-1, 1]. If float, then output is divided by that number. If Callable, then the output is passed as a paramete to the given function, then the output is divided by the result. (Default: True)
channels_first – Set channels first or length first in result. (Default: True)
num_frames – Number of frames to load. 0 to load everything after the offset. (Default: 0)
offset – Number of frames from the start of the file to begin data loading. (Default: 0)
signalinfo – A sox_signalinfo_t type, which could be helpful if the audio type cannot be automatically determined. (Default: None)
encodinginfo – A sox_encodinginfo_t type, which could be set if the audio type cannot be automatically determined. (Default: None)
filetype – A filetype or extension to be set if sox cannot determine it automatically. (Default: None)

Returns

An output tensor of size [C x L] or [L x C] where: L is the number of audio frames and C is the number of channels. An integer which is the sample rate of the audio (as listed in the metadata of the file)

Return type

(Tensor, int)

Example

>>> data, sample_rate = torchaudio.load('foo.mp3')
>>> print(data.size())
torch.Size([2, 278756])
>>> print(sample_rate)
44100
>>> data_vol_normalized, _ = torchaudio.load('foo.mp3', normalization=lambda x: torch.abs(x).max())
>>> print(data_vol_normalized.abs().max())
1.

torchaudio.backend.sox_backend.load_wav(filepath, **kwargs)[source]

Loads a wave file.

It assumes that the wav file uses 16 bit per sample that needs normalization by shifting the input right by 16 bits.

Parameters

filepath – Path to audio file

Returns

An output tensor of size [C x L] or [L x C] where L is the number: of audio frames and C is the number of channels. An integer which is the sample rate of the audio (as listed in the metadata of the file)

Return type

(Tensor, int)

save

torchaudio.backend.sox_backend.save(filepath: str, src: torch.Tensor, sample_rate: int, precision: int = 16, channels_first: bool = True) → None[source]

Saves a Tensor on file as an audio file

Parameters

filepath – Path to audio file
src – An input 2D tensor of shape [C x L] or [L x C] where L is the number of audio frames, C is the number of channels
sample_rate – An integer which is the sample rate of the audio (as listed in the metadata of the file)
Bit precision (Default (precision) – 16)
channels_first (bool, optional) – Set channels first or length first in result. ( Default: True)

others

torchaudio.backend.sox_backend.get_sox_bool(i: int = 0) → Any[source]

Get enum of sox_bool for sox encodinginfo options.

Parameters: i (int, optional) – Choose type or get a dict with all possible options use __members__ to see all options when not specified. (Default: sox_false or 0)
Returns: A sox_bool type
Return type: sox_bool

torchaudio.backend.sox_backend.get_sox_encoding_t(i: int = None) → torchaudio.backend.common.EncodingInfo[source]

Get enum of sox_encoding_t for sox encodings.

Parameters: i (int, optional) – Choose type or get a dict with all possible options use __members__ to see all options when not specified. (Default: None)
Returns: A sox_encoding_t type for output encoding
Return type: sox_encoding_t

torchaudio.backend.sox_backend.get_sox_option_t(i: int = 2) → Any[source]

Get enum of sox_option_t for sox encodinginfo options.

Parameters: i (int, optional) – Choose type or get a dict with all possible options use __members__ to see all options when not specified. (Default: sox_option_default or 2)
Returns: A sox_option_t type
Return type: sox_option_t

torchaudio.backend.sox_backend.save_encinfo(filepath: str, src: torch.Tensor, channels_first: bool = True, signalinfo: Optional[torchaudio.backend.common.SignalInfo] = None, encodinginfo: Optional[torchaudio.backend.common.EncodingInfo] = None, filetype: Optional[str] = None) → None[source]

Saves a tensor of an audio signal to disk as a standard format like mp3, wav, etc.

Parameters

filepath (str) – Path to audio file
src (Tensor) – An input 2D tensor of shape [C x L] or [L x C] where L is the number of audio frames, C is the number of channels
channels_first (bool, optional) – Set channels first or length first in result. (Default: True)
signalinfo (sox_signalinfo_t, optional) – A sox_signalinfo_t type, which could be helpful if the audio type cannot be automatically determined (Default: None).
encodinginfo (sox_encodinginfo_t, optional) – A sox_encodinginfo_t type, which could be set if the audio type cannot be automatically determined (Default: None).
filetype (str, optional) – A filetype or extension to be set if sox cannot determine it automatically. (Default: None)

Example

>>> data, sample_rate = torchaudio.load('foo.mp3')
>>> torchaudio.save('foo.wav', data, sample_rate)

torchaudio.backend.sox_backend.sox_encodinginfo_t() → torchaudio.backend.common.EncodingInfo[source]

Create a sox_encodinginfo_t object. This object can be used to set the encoding type, bit precision, compression factor, reverse bytes, reverse nibbles, reverse bits and endianness. This can be used in an effects chain to encode the final output or to save a file with a specific encoding. For example, one could use the sox ulaw encoding to do 8-bit ulaw encoding. Note in a tensor output the result will be a 32-bit number, but number of unique values will be determined by the bit precision.

Returns: sox_encodinginfo_t(object)

encoding (sox_encoding_t), output encoding
bits_per_sample (int), bit precision, same as precision in sox_signalinfo_t
compression (float), compression for lossy formats, 0.0 for default compression
reverse_bytes (sox_option_t), reverse bytes, use sox_option_default
reverse_nibbles (sox_option_t), reverse nibbles, use sox_option_default
reverse_bits (sox_option_t), reverse bytes, use sox_option_default
opposite_endian (sox_bool), change endianness, use sox_false

Example

>>> ei = torchaudio.sox_encodinginfo_t()
>>> ei.encoding = torchaudio.get_sox_encoding_t(1)
>>> ei.bits_per_sample = 16
>>> ei.compression = 0
>>> ei.reverse_bytes = torchaudio.get_sox_option_t(2)
>>> ei.reverse_nibbles = torchaudio.get_sox_option_t(2)
>>> ei.reverse_bits = torchaudio.get_sox_option_t(2)
>>> ei.opposite_endian = torchaudio.get_sox_bool(0)

torchaudio.backend.sox_backend.sox_signalinfo_t() → torchaudio.backend.common.SignalInfo[source]

Create a sox_signalinfo_t object. This object can be used to set the sample rate, number of channels, length, bit precision and headroom multiplier primarily for effects

Returns: sox_signalinfo_t(object)

rate (float), sample rate as a float, practically will likely be an integer float
channel (int), number of audio channels
precision (int), bit precision
length (int), length of audio in samples * channels, 0 for unspecified and -1 for unknown
mult (float, optional), headroom multiplier for effects and None for no multiplier

Example

>>> si = torchaudio.sox_signalinfo_t()
>>> si.channels = 1
>>> si.rate = 16000.
>>> si.precision = 16
>>> si.length = 0

Soundfile Backend

The "soundfile" backend is available when SoundFile is installed. This backend is the default on Windows.

You can switch from another backend to the "soundfile" backend with the following;

torchaudio.set_audio_backend("soundfile")

Note

If you are switching from “soundfile” (legacy interface) <soundfile_legacy_backend> backend, set torchaudio.USE_SOUNDFILE_LEGACY_INTERFACE flag before switching the backend.

info

torchaudio.backend._soundfile_backend.info(filepath: str, format: Optional[str] = None) → torchaudio.backend.common.AudioMetaData[source]

Get signal information of an audio file.

Parameters

filepath (path-like object or file-like object) –

Source of audio data. .. note:

* This argument is intentionally annotated as ``str`` only,
  for the consistency with "sox_io" backend, which has a restriction
  on type annotation due to TorchScript compiler compatiblity.

format (str, optional) – Not used. PySoundFile does not accept format hint.

Returns

meta data of the given audio.

Return type

AudioMetaData

load

torchaudio.backend._soundfile_backend.load(filepath: str, frame_offset: int = 0, num_frames: int = -1, normalize: bool = True, channels_first: bool = True, format: Optional[str] = None) → Tuple[torch.Tensor, int][source]

Load audio data from file.

Note

The formats this function can handle depend on the soundfile installation. This function is tested on the following formats;

WAV
- 32-bit floating-point
- 32-bit signed integer
- 16-bit signed integer
- 8-bit unsigned integer
FLAC
OGG/VORBIS
SPHERE

By default (normalize=True, channels_first=True), this function returns Tensor with float32 dtype and the shape of [channel, time]. The samples are normalized to fit in the range of [-1.0, 1.0].

When the input format is WAV with integer type, such as 32-bit signed integer, 16-bit signed integer and 8-bit unsigned integer (24-bit signed integer is not supported), by providing normalize=False, this function can return integer Tensor, where the samples are expressed within the whole range of the corresponding dtype, that is, int32 tensor for 32-bit signed PCM, int16 for 16-bit signed PCM and uint8 for 8-bit unsigned PCM.

normalize parameter has no effect on 32-bit floating-point WAV and other formats, such as flac and mp3. For these formats, this function always returns float32 Tensor with values normalized to [-1.0, 1.0].

Parameters

filepath (path-like object or file-like object) –

Source of audio data. .. note:

* This argument is intentionally annotated as ``str`` only,
  for the consistency with "sox_io" backend, which has a restriction
  on type annotation due to TorchScript compiler compatiblity.

frame_offset (int) – Number of frames to skip before start reading data.
num_frames (int) – Maximum number of frames to read. -1 reads all the remaining samples, starting from frame_offset. This function may return the less number of frames if there is not enough frames in the given file.
normalize (bool) – When True, this function always return float32, and sample values are normalized to [-1.0, 1.0]. If input file is integer WAV, giving False will change the resulting Tensor type to integer type. This argument has no effect for formats other than integer WAV type.
channels_first (bool) – When True, the returned Tensor has dimension [channel, time]. Otherwise, the returned Tensor’s dimension is [time, channel].
format (str, optional) – Not used. PySoundFile does not accept format hint.

Returns

Resulting Tensor and sample rate.: If the input file has integer wav format and normalization is off, then it has integer type, else float32 type. If channels_first=True, it has [channel, time] else [time, channel].

Return type

Tuple[torch.Tensor, int]

torchaudio.backend._soundfile_backend.load_wav(filepath: str, frame_offset: int = 0, num_frames: int = -1, channels_first: bool = True) → Tuple[torch.Tensor, int][source]

Load wave file.

This function is defined only for the purpose of compatibility against other backend for simple usecases, such as torchaudio.load_wav(filepath). The implementation is same as load().

save

torchaudio.backend._soundfile_backend.save(filepath: str, src: torch.Tensor, sample_rate: int, channels_first: bool = True, compression: Optional[float] = None, format: Optional[str] = None, encoding: Optional[str] = None, bits_per_sample: Optional[int] = None)[source]

Save audio data to file.

Note

The formats this function can handle depend on the soundfile installation. This function is tested on the following formats;

WAV
- 32-bit floating-point
- 32-bit signed integer
- 16-bit signed integer
- 8-bit unsigned integer
FLAC
OGG/VORBIS
SPHERE

Parameters

filepath (str or pathlib.Path) – Path to audio file. This functionalso handles pathlib.Path objects, but is annotated as str for the consistency with “sox_io” backend, which has a restriction on type annotation for TorchScript compiler compatiblity.
src (torch.Tensor) – Audio data to save. must be 2D tensor.
sample_rate (int) – sampling rate
channels_first (bool) – If True, the given tensor is interpreted as [channel, time], otherwise [time, channel].
compression (Optional[float]) – Not used. It is here only for interface compatibility reson with “sox_io” backend.
format (str, optional) –
Override the audio format. When filepath argument is path-like object, audio format is inferred from file extension. If the file extension is missing or different, you can specify the correct format with this argument.

When filepath argument is file-like object, this argument is required.

Valid values are "wav", "ogg", "vorbis", "flac" and "sph".
encoding (str, optional) –
Changes the encoding for supported formats. This argument is effective only for supported formats, sush as "wav", ""flac" and "sph". Valid values are;
- "PCM_S" (signed integer Linear PCM)
- "PCM_U" (unsigned integer Linear PCM)
- "PCM_F" (floating point PCM)
- "ULAW" (mu-law)
- "ALAW" (a-law)
bits_per_sample (int, optional) – Changes the bit depth for the supported formats. When format is one of "wav", "flac" or "sph", you can change the bit depth. Valid values are 8, 16, 24, 32 and 64.

Supported formats/encodings/bit depth/compression are:

"wav"

32-bit floating-point PCM
32-bit signed integer PCM
24-bit signed integer PCM
16-bit signed integer PCM
8-bit unsigned integer PCM
8-bit mu-law
8-bit a-law

Note: Default encoding/bit depth is determined by the dtype of: the input Tensor.

"flac"

8-bit
16-bit
24-bit (default)

"ogg", "vorbis"

Doesn’t accept changing configuration.

"sph"

8-bit signed integer PCM
16-bit signed integer PCM
24-bit signed integer PCM
32-bit signed integer PCM (default)
8-bit mu-law
8-bit a-law
16-bit a-law
24-bit a-law
32-bit a-law

Legacy Interface (Deprecated)

"soundfile" backend with legacy interface is made available for backward compatibility reason, however this interface is deprecated and will be removed in the 0.9.0 release.

To switch to this backend/interface, set torchaudio.USE_SOUNDFILE_LEGACY_INTERFACE flag before switching the backend.

torchaudio.USE_SOUNDFILE_LEGACY_INTERFACE = True
torchaudio.set_audio_backend("soundfile")  # The legacy interface

info

torchaudio.backend.soundfile_backend.info(filepath: str) → Tuple[torchaudio.backend.common.SignalInfo, torchaudio.backend.common.EncodingInfo][source]

Gets metadata from an audio file without loading the signal.

Parameters

filepath – Path to audio file

Returns

A si (sox_signalinfo_t) signal: info as a python object. An ei (sox_encodinginfo_t) encoding info

Return type

(sox_signalinfo_t, sox_encodinginfo_t)

Example

>>> si, ei = torchaudio.info('foo.wav')
>>> rate, channels, encoding = si.rate, si.channels, ei.encoding

load

torchaudio.backend.soundfile_backend.load(filepath: str, out: Optional[torch.Tensor] = None, normalization: Optional[bool] = True, channels_first: Optional[bool] = True, num_frames: int = 0, offset: int = 0, signalinfo: torchaudio.backend.common.SignalInfo = None, encodinginfo: torchaudio.backend.common.EncodingInfo = None, filetype: Optional[str] = None) → Tuple[torch.Tensor, int][source]

Loads an audio file from disk into a tensor

Parameters

filepath – Path to audio file
out – An optional output tensor to use instead of creating one. (Default: None)
normalization – Optional normalization. If boolean True, then output is divided by 1 << 31. Assuming the input is signed 32-bit audio, this normalizes to [-1, 1]. If float, then output is divided by that number. If Callable, then the output is passed as a paramete to the given function, then the output is divided by the result. (Default: True)
channels_first – Set channels first or length first in result. (Default: True)
num_frames – Number of frames to load. 0 to load everything after the offset. (Default: 0)
offset – Number of frames from the start of the file to begin data loading. (Default: 0)
signalinfo – A sox_signalinfo_t type, which could be helpful if the audio type cannot be automatically determined. (Default: None)
encodinginfo – A sox_encodinginfo_t type, which could be set if the audio type cannot be automatically determined. (Default: None)
filetype – A filetype or extension to be set if sox cannot determine it automatically. (Default: None)

Returns

An output tensor of size [C x L] or [L x C] where: L is the number of audio frames and C is the number of channels. An integer which is the sample rate of the audio (as listed in the metadata of the file)

Return type

(Tensor, int)

Example

>>> data, sample_rate = torchaudio.load('foo.mp3')
>>> print(data.size())
torch.Size([2, 278756])
>>> print(sample_rate)
44100
>>> data_vol_normalized, _ = torchaudio.load('foo.mp3', normalization=lambda x: torch.abs(x).max())
>>> print(data_vol_normalized.abs().max())
1.

torchaudio.backend.soundfile_backend.load_wav(filepath, **kwargs)[source]

Loads a wave file.

It assumes that the wav file uses 16 bit per sample that needs normalization by shifting the input right by 16 bits.

Parameters

filepath – Path to audio file

Returns

An output tensor of size [C x L] or [L x C] where L is the number: of audio frames and C is the number of channels. An integer which is the sample rate of the audio (as listed in the metadata of the file)

Return type

(Tensor, int)

save

torchaudio.backend.soundfile_backend.save(filepath: str, src: torch.Tensor, sample_rate: int, precision: int = 16, channels_first: bool = True) → None[source]

Saves a Tensor on file as an audio file

Parameters

filepath – Path to audio file
src – An input 2D tensor of shape [C x L] or [L x C] where L is the number of audio frames, C is the number of channels
sample_rate – An integer which is the sample rate of the audio (as listed in the metadata of the file)
Bit precision (Default (precision) – 16)
channels_first (bool, optional) – Set channels first or length first in result. ( Default: True)

torchaudio.backend

Overview

Availability

Changes in default backend and deprecation

Common Data Structure

AudioMetaData

SignalInfo (Deprecated)

EncodingInfo (Deprecated)

Sox IO Backend

info

load

save

Sox Backend (Deprecated)

info

load

save

others

Soundfile Backend

info

load

save

Legacy Interface (Deprecated)

info

load

save

Docs

Tutorials

Resources

Backend	0.8.0	0.9.0
`"sox_io"`	Default on Linx/macOS	Default on Linux/macOS
`"sox"` (deprecated)	Available	Removed
`"soundfile"`	Default on Windows	Default on Windows
`"soundfile"` (legacy interface, deprecated)	Available	Removed