Shortcuts

torchaudio.backend

torchaudio.backend module provides implementations for audio file I/O, using different backend libraries. To switch backend, use torchaudio.set_audio_backend(). To check the current backend use torchaudio.get_audio_backend().

Warning

Although sox backend is default for backward compatibility reason, it has a number of issues, therefore it is highly recommended to use sox_io backend instead. Note, however, that due to the interface refinement, functions defined in sox backend and those defined in sox_io backend do not have the same signatures.

Note

Instead of calling functions in torchaudio.backend directly, please use torchaudio.info, torhcaudio.load, torchaudio.load_wav and torchaudio.save with proper backend set with torchaudio.get_audio_backend().

There are currently three implementations available.

sox backend is the original backend which is built on libsox. This module is currently default but is known to have number of issues, such as wrong handling of WAV files other than 16-bit signed integer. Users are encouraged to use sox_io backend. This backend requires C++ extension module and is not available on Windows system.

sox_io backend is the new backend which is built on libsox and bound to Python with Torchscript. This module addresses all the known issues sox backend has. Function calls to this backend can be Torchscriptable. This backend requires C++ extension module and is not available on Windows system.

soundfile backend is built on PySoundFile. You need to install PySoundFile separately.

Common Data Structure

Structures used to exchange data between Python interface and libsox. They are used by sox and soundfile but not by sox_io.

class torchaudio.backend.common.SignalInfo(channels: Optional[int] = None, rate: Optional[float] = None, precision: Optional[int] = None, length: Optional[int] = None)[source]

Data class returned info functions.

Used by sox backend and soundfile backend

See https://fossies.org/dox/sox-14.4.2/structsox__signalinfo__t.html

Variables
  • channels (Optional[int]) – The number of channels

  • rate (Optional[float]) – Sampleing rate

  • precision (Optional[int]) – Bit depth

  • length (Optional[int]) – For sox backend, the number of samples. (frames * channels). For soundfile backend, the number of frames.

class torchaudio.backend.common.EncodingInfo(encoding: Any = None, bits_per_sample: Optional[int] = None, compression: Optional[float] = None, reverse_bytes: Any = None, reverse_nibbles: Any = None, reverse_bits: Any = None, opposite_endian: Optional[bool] = None)[source]

Data class returned info functions.

Used by sox backend and soundfile backend

See https://fossies.org/dox/sox-14.4.2/structsox__encodinginfo__t.html

Variables
  • encoding (Optional[int]) – sox_encoding_t

  • bits_per_sample (Optional[int]) – bit depth

  • compression (Optional[float]) – Compression option

  • reverse_bytes (Any) –

  • reverse_nibbles (Any) –

  • reverse_bits (Any) –

  • opposite_endian (Optional[bool]) –

Sox Backend

sox backend is available on torchaudio installation with C++ extension. It is currently not available on Windows system.

It is currently default backend when it’s available. You can switch from another backend to sox backend with the following;

torchaudio.set_audio_backend("sox")

info

torchaudio.backend.sox_backend.info(filepath: str) → Tuple[torchaudio.backend.common.SignalInfo, torchaudio.backend.common.EncodingInfo][source]

Gets metadata from an audio file without loading the signal.

Parameters

filepath – Path to audio file

Returns

A si (sox_signalinfo_t) signal

info as a python object. An ei (sox_encodinginfo_t) encoding info

Return type

(sox_signalinfo_t, sox_encodinginfo_t)

Example
>>> si, ei = torchaudio.info('foo.wav')
>>> rate, channels, encoding = si.rate, si.channels, ei.encoding

load

torchaudio.backend.sox_backend.load(filepath: str, out: Optional[torch.Tensor] = None, normalization: bool = True, channels_first: bool = True, num_frames: int = 0, offset: int = 0, signalinfo: torchaudio.backend.common.SignalInfo = None, encodinginfo: torchaudio.backend.common.EncodingInfo = None, filetype: Optional[str] = None) → Tuple[torch.Tensor, int][source]

Loads an audio file from disk into a tensor

Parameters
  • filepath – Path to audio file

  • out – An optional output tensor to use instead of creating one. (Default: None)

  • normalization – Optional normalization. If boolean True, then output is divided by 1 << 31. Assuming the input is signed 32-bit audio, this normalizes to [-1, 1]. If float, then output is divided by that number. If Callable, then the output is passed as a paramete to the given function, then the output is divided by the result. (Default: True)

  • channels_first – Set channels first or length first in result. (Default: True)

  • num_frames – Number of frames to load. 0 to load everything after the offset. (Default: 0)

  • offset – Number of frames from the start of the file to begin data loading. (Default: 0)

  • signalinfo – A sox_signalinfo_t type, which could be helpful if the audio type cannot be automatically determined. (Default: None)

  • encodinginfo – A sox_encodinginfo_t type, which could be set if the audio type cannot be automatically determined. (Default: None)

  • filetype – A filetype or extension to be set if sox cannot determine it automatically. (Default: None)

Returns

An output tensor of size [C x L] or [L x C] where

L is the number of audio frames and C is the number of channels. An integer which is the sample rate of the audio (as listed in the metadata of the file)

Return type

(Tensor, int)

Example
>>> data, sample_rate = torchaudio.load('foo.mp3')
>>> print(data.size())
torch.Size([2, 278756])
>>> print(sample_rate)
44100
>>> data_vol_normalized, _ = torchaudio.load('foo.mp3', normalization=lambda x: torch.abs(x).max())
>>> print(data_vol_normalized.abs().max())
1.
torchaudio.backend.sox_backend.load_wav(filepath, **kwargs)[source]

Loads a wave file.

It assumes that the wav file uses 16 bit per sample that needs normalization by shifting the input right by 16 bits.

Parameters

filepath – Path to audio file

Returns

An output tensor of size [C x L] or [L x C] where L is the number

of audio frames and C is the number of channels. An integer which is the sample rate of the audio (as listed in the metadata of the file)

Return type

(Tensor, int)

save

torchaudio.backend.sox_backend.save(filepath: str, src: torch.Tensor, sample_rate: int, precision: int = 16, channels_first: bool = True)None[source]

Saves a Tensor on file as an audio file

Parameters
  • filepath – Path to audio file

  • src – An input 2D tensor of shape [C x L] or [L x C] where L is the number of audio frames, C is the number of channels

  • sample_rate – An integer which is the sample rate of the audio (as listed in the metadata of the file)

  • Bit precision (Default (precision) – 16)

  • channels_first (bool, optional) – Set channels first or length first in result. ( Default: True)

others

torchaudio.backend.sox_backend.get_sox_bool(i: int = 0) → Any[source]

Get enum of sox_bool for sox encodinginfo options.

Parameters

i (int, optional) – Choose type or get a dict with all possible options use __members__ to see all options when not specified. (Default: sox_false or 0)

Returns

A sox_bool type

Return type

sox_bool

torchaudio.backend.sox_backend.get_sox_encoding_t(i: int = None)torchaudio.backend.common.EncodingInfo[source]

Get enum of sox_encoding_t for sox encodings.

Parameters

i (int, optional) – Choose type or get a dict with all possible options use __members__ to see all options when not specified. (Default: None)

Returns

A sox_encoding_t type for output encoding

Return type

sox_encoding_t

torchaudio.backend.sox_backend.get_sox_option_t(i: int = 2) → Any[source]

Get enum of sox_option_t for sox encodinginfo options.

Parameters

i (int, optional) – Choose type or get a dict with all possible options use __members__ to see all options when not specified. (Default: sox_option_default or 2)

Returns

A sox_option_t type

Return type

sox_option_t

torchaudio.backend.sox_backend.save_encinfo(filepath: str, src: torch.Tensor, channels_first: bool = True, signalinfo: Optional[torchaudio.backend.common.SignalInfo] = None, encodinginfo: Optional[torchaudio.backend.common.EncodingInfo] = None, filetype: Optional[str] = None)None[source]

Saves a tensor of an audio signal to disk as a standard format like mp3, wav, etc.

Parameters
  • filepath (str) – Path to audio file

  • src (Tensor) – An input 2D tensor of shape [C x L] or [L x C] where L is the number of audio frames, C is the number of channels

  • channels_first (bool, optional) – Set channels first or length first in result. (Default: True)

  • signalinfo (sox_signalinfo_t, optional) – A sox_signalinfo_t type, which could be helpful if the audio type cannot be automatically determined (Default: None).

  • encodinginfo (sox_encodinginfo_t, optional) – A sox_encodinginfo_t type, which could be set if the audio type cannot be automatically determined (Default: None).

  • filetype (str, optional) – A filetype or extension to be set if sox cannot determine it automatically. (Default: None)

Example
>>> data, sample_rate = torchaudio.load('foo.mp3')
>>> torchaudio.save('foo.wav', data, sample_rate)
torchaudio.backend.sox_backend.sox_encodinginfo_t()torchaudio.backend.common.EncodingInfo[source]

Create a sox_encodinginfo_t object. This object can be used to set the encoding type, bit precision, compression factor, reverse bytes, reverse nibbles, reverse bits and endianness. This can be used in an effects chain to encode the final output or to save a file with a specific encoding. For example, one could use the sox ulaw encoding to do 8-bit ulaw encoding. Note in a tensor output the result will be a 32-bit number, but number of unique values will be determined by the bit precision.

Returns: sox_encodinginfo_t(object)
  • encoding (sox_encoding_t), output encoding

  • bits_per_sample (int), bit precision, same as precision in sox_signalinfo_t

  • compression (float), compression for lossy formats, 0.0 for default compression

  • reverse_bytes (sox_option_t), reverse bytes, use sox_option_default

  • reverse_nibbles (sox_option_t), reverse nibbles, use sox_option_default

  • reverse_bits (sox_option_t), reverse bytes, use sox_option_default

  • opposite_endian (sox_bool), change endianness, use sox_false

Example
>>> ei = torchaudio.sox_encodinginfo_t()
>>> ei.encoding = torchaudio.get_sox_encoding_t(1)
>>> ei.bits_per_sample = 16
>>> ei.compression = 0
>>> ei.reverse_bytes = torchaudio.get_sox_option_t(2)
>>> ei.reverse_nibbles = torchaudio.get_sox_option_t(2)
>>> ei.reverse_bits = torchaudio.get_sox_option_t(2)
>>> ei.opposite_endian = torchaudio.get_sox_bool(0)
torchaudio.backend.sox_backend.sox_signalinfo_t()torchaudio.backend.common.SignalInfo[source]

Create a sox_signalinfo_t object. This object can be used to set the sample rate, number of channels, length, bit precision and headroom multiplier primarily for effects

Returns: sox_signalinfo_t(object)
  • rate (float), sample rate as a float, practically will likely be an integer float

  • channel (int), number of audio channels

  • precision (int), bit precision

  • length (int), length of audio in samples * channels, 0 for unspecified and -1 for unknown

  • mult (float, optional), headroom multiplier for effects and None for no multiplier

Example
>>> si = torchaudio.sox_signalinfo_t()
>>> si.channels = 1
>>> si.rate = 16000.
>>> si.precision = 16
>>> si.length = 0

Sox IO Backend

sox_io backend is available on torchaudio installation with C++ extension. It is currently not available on Windows system.

This new backend is recommended over sox backend. You can switch from another backend to sox_io backend with the following;

torchaudio.set_audio_backend("sox_io")

The function call to this backend can be Torchsript-able. You can apply torch.jit.script() and dump the object to file, then call it from C++ application.

info

class torchaudio.backend.sox_io_backend.AudioMetaData(sample_rate: int, num_frames: int, num_channels: int)[source]

Data class to be returned by info().

Variables
  • sample_rate (int) – Sample rate

  • num_frames (int) – The number of frames

  • num_channels (int) – The number of channels

torchaudio.backend.sox_io_backend.info(filepath: str)torchaudio.backend.sox_io_backend.AudioMetaData[source]

Get signal information of an audio file.

Parameters

filepath (str) – Path to audio file

Returns

meta data of the given audio.

Return type

AudioMetaData

load

torchaudio.backend.sox_io_backend.load(filepath: str, frame_offset: int = 0, num_frames: int = - 1, normalize: bool = True, channels_first: bool = True) → Tuple[torch.Tensor, int][source]

Load audio data from file.

Note

This function can handle all the codecs that underlying libsox can handle, however it is tested on the following formats;

  • WAV

    • 32-bit floating-point

    • 32-bit signed integer

    • 16-bit signed integer

    • 8-bit unsigned integer

  • MP3

  • FLAC

  • OGG/VORBIS

  • OPUS

To load MP3, FLAC, OGG/VORBIS, OPUS and other codecs libsox does not handle natively, your installation of torchaudio has to be linked to libsox and corresponding codec libraries such as libmad or libmp3lame etc.

By default (normalize=True, channels_first=True), this function returns Tensor with float32 dtype and the shape of [channel, time]. The samples are normalized to fit in the range of [-1.0, 1.0].

When the input format is WAV with integer type, such as 32-bit signed integer, 16-bit signed integer and 8-bit unsigned integer (24-bit signed integer is not supported), by providing normalize=False, this function can return integer Tensor, where the samples are expressed within the whole range of the corresponding dtype, that is, int32 tensor for 32-bit signed PCM, int16 for 16-bit signed PCM and uint8 for 8-bit unsigned PCM.

normalize parameter has no effect on 32-bit floating-point WAV and other formats, such as flac and mp3. For these formats, this function always returns float32 Tensor with values normalized to [-1.0, 1.0].

Parameters
  • filepath (str) – Path to audio file

  • frame_offset (int) – Number of frames to skip before start reading data.

  • num_frames (int) – Maximum number of frames to read. -1 reads all the remaining samples, starting from frame_offset. This function may return the less number of frames if there is not enough frames in the given file.

  • normalize (bool) – When True, this function always return float32, and sample values are normalized to [-1.0, 1.0]. If input file is integer WAV, giving False will change the resulting Tensor type to integer type. This argument has no effect for formats other than integer WAV type.

  • channels_first (bool) – When True, the returned Tensor has dimension [channel, time]. Otherwise, the returned Tensor’s dimension is [time, channel].

Returns

If the input file has integer wav format and normalization is off, then it has integer type, else float32 type. If channels_first=True, it has [channel, time] else [time, channel].

Return type

torch.Tensor

torchaudio.backend.sox_io_backend.load_wav(filepath: str, frame_offset: int = 0, num_frames: int = - 1, normalize: bool = True, channels_first: bool = True) → Tuple[torch.Tensor, int]

Load audio data from file.

Note

This function can handle all the codecs that underlying libsox can handle, however it is tested on the following formats;

  • WAV

    • 32-bit floating-point

    • 32-bit signed integer

    • 16-bit signed integer

    • 8-bit unsigned integer

  • MP3

  • FLAC

  • OGG/VORBIS

  • OPUS

To load MP3, FLAC, OGG/VORBIS, OPUS and other codecs libsox does not handle natively, your installation of torchaudio has to be linked to libsox and corresponding codec libraries such as libmad or libmp3lame etc.

By default (normalize=True, channels_first=True), this function returns Tensor with float32 dtype and the shape of [channel, time]. The samples are normalized to fit in the range of [-1.0, 1.0].

When the input format is WAV with integer type, such as 32-bit signed integer, 16-bit signed integer and 8-bit unsigned integer (24-bit signed integer is not supported), by providing normalize=False, this function can return integer Tensor, where the samples are expressed within the whole range of the corresponding dtype, that is, int32 tensor for 32-bit signed PCM, int16 for 16-bit signed PCM and uint8 for 8-bit unsigned PCM.

normalize parameter has no effect on 32-bit floating-point WAV and other formats, such as flac and mp3. For these formats, this function always returns float32 Tensor with values normalized to [-1.0, 1.0].

Parameters
  • filepath (str) – Path to audio file

  • frame_offset (int) – Number of frames to skip before start reading data.

  • num_frames (int) – Maximum number of frames to read. -1 reads all the remaining samples, starting from frame_offset. This function may return the less number of frames if there is not enough frames in the given file.

  • normalize (bool) – When True, this function always return float32, and sample values are normalized to [-1.0, 1.0]. If input file is integer WAV, giving False will change the resulting Tensor type to integer type. This argument has no effect for formats other than integer WAV type.

  • channels_first (bool) – When True, the returned Tensor has dimension [channel, time]. Otherwise, the returned Tensor’s dimension is [time, channel].

Returns

If the input file has integer wav format and normalization is off, then it has integer type, else float32 type. If channels_first=True, it has [channel, time] else [time, channel].

Return type

torch.Tensor

save

torchaudio.backend.sox_io_backend.save(filepath: str, tensor: torch.Tensor, sample_rate: int, channels_first: bool = True, compression: Optional[float] = None)[source]

Save audio data to file.

Note

Supported formats are;

  • WAV

    • 32-bit floating-point

    • 32-bit signed integer

    • 16-bit signed integer

    • 8-bit unsigned integer

  • MP3

  • FLAC

  • OGG/VORBIS

To save MP3, FLAC, OGG/VORBIS, and other codecs libsox does not handle natively, your installation of torchaudio has to be linked to libsox and corresponding codec libraries such as libmad or libmp3lame etc.

Parameters
  • filepath (str) – Path to save file.

  • tensor (torch.Tensor) – Audio data to save. must be 2D tensor.

  • sample_rate (int) – sampling rate

  • channels_first (bool) – If True, the given tensor is interpreted as [channel, time], otherwise [time, channel].

  • compression (Optional[float]) –

    Used for formats other than WAV. This corresponds to -C option of sox command.

    • MP3: Either bitrate (in kbps) with quality factor, such as 128.2, or
      VBR encoding with quality factor such as -4.2. Default: -4.5.
    • FLAC: compression level. Whole number from 0 to 8.
      8 is default and highest compression.
    • OGG/VORBIS: number from -1 to 10; -1 is the highest compression
      and lowest quality. Default: 3.

    See the detail at http://sox.sourceforge.net/soxformat.html.

Soundfile Backend

soundfile backend is available when PySoundFile is installed. This backend works on torchaudio installation without C++ extension. (i.e. Windows)

You can switch from another backend to soundfile backend with the following;

torchaudio.set_audio_backend("soundfile")

info

torchaudio.backend.soundfile_backend.info(filepath: str) → Tuple[torchaudio.backend.common.SignalInfo, torchaudio.backend.common.EncodingInfo][source]

Gets metadata from an audio file without loading the signal.

Parameters

filepath – Path to audio file

Returns

A si (sox_signalinfo_t) signal

info as a python object. An ei (sox_encodinginfo_t) encoding info

Return type

(sox_signalinfo_t, sox_encodinginfo_t)

Example
>>> si, ei = torchaudio.info('foo.wav')
>>> rate, channels, encoding = si.rate, si.channels, ei.encoding

load

torchaudio.backend.soundfile_backend.load(filepath: str, out: Optional[torch.Tensor] = None, normalization: Optional[bool] = True, channels_first: Optional[bool] = True, num_frames: int = 0, offset: int = 0, signalinfo: torchaudio.backend.common.SignalInfo = None, encodinginfo: torchaudio.backend.common.EncodingInfo = None, filetype: Optional[str] = None) → Tuple[torch.Tensor, int][source]

Loads an audio file from disk into a tensor

Parameters
  • filepath – Path to audio file

  • out – An optional output tensor to use instead of creating one. (Default: None)

  • normalization – Optional normalization. If boolean True, then output is divided by 1 << 31. Assuming the input is signed 32-bit audio, this normalizes to [-1, 1]. If float, then output is divided by that number. If Callable, then the output is passed as a paramete to the given function, then the output is divided by the result. (Default: True)

  • channels_first – Set channels first or length first in result. (Default: True)

  • num_frames – Number of frames to load. 0 to load everything after the offset. (Default: 0)

  • offset – Number of frames from the start of the file to begin data loading. (Default: 0)

  • signalinfo – A sox_signalinfo_t type, which could be helpful if the audio type cannot be automatically determined. (Default: None)

  • encodinginfo – A sox_encodinginfo_t type, which could be set if the audio type cannot be automatically determined. (Default: None)

  • filetype – A filetype or extension to be set if sox cannot determine it automatically. (Default: None)

Returns

An output tensor of size [C x L] or [L x C] where

L is the number of audio frames and C is the number of channels. An integer which is the sample rate of the audio (as listed in the metadata of the file)

Return type

(Tensor, int)

Example
>>> data, sample_rate = torchaudio.load('foo.mp3')
>>> print(data.size())
torch.Size([2, 278756])
>>> print(sample_rate)
44100
>>> data_vol_normalized, _ = torchaudio.load('foo.mp3', normalization=lambda x: torch.abs(x).max())
>>> print(data_vol_normalized.abs().max())
1.
torchaudio.backend.soundfile_backend.load_wav(filepath, **kwargs)[source]

Loads a wave file.

It assumes that the wav file uses 16 bit per sample that needs normalization by shifting the input right by 16 bits.

Parameters

filepath – Path to audio file

Returns

An output tensor of size [C x L] or [L x C] where L is the number

of audio frames and C is the number of channels. An integer which is the sample rate of the audio (as listed in the metadata of the file)

Return type

(Tensor, int)

save

torchaudio.backend.soundfile_backend.save(filepath: str, src: torch.Tensor, sample_rate: int, precision: int = 16, channels_first: bool = True)None[source]

Saves a Tensor on file as an audio file

Parameters
  • filepath – Path to audio file

  • src – An input 2D tensor of shape [C x L] or [L x C] where L is the number of audio frames, C is the number of channels

  • sample_rate – An integer which is the sample rate of the audio (as listed in the metadata of the file)

  • Bit precision (Default (precision) – 16)

  • channels_first (bool, optional) – Set channels first or length first in result. ( Default: True)

Docs

Access comprehensive developer documentation for PyTorch

View Docs

Tutorials

Get in-depth tutorials for beginners and advanced developers

View Tutorials

Resources

Find development resources and get your questions answered

View Resources