torchaudio.backend¶
Overview¶
torchaudio.backend
module provides implementations for audio file I/O functionalities, which are torchaudio.info
, torchaudio.load
, and torchaudio.save
.
There are currently four implementations available.
“sox_io” (default on Linux/macOS)
“soundfile” (default on Windows)
Note
Instead of calling functions in torchaudio.backend
directly, please use torchaudio.info
, torchaudio.load
, and torchaudio.save
with proper backend set with torchaudio.set_audio_backend()
.
Availability¶
"sox_io"
backend requires C++ extension module, which is included in Linux/macOS binary distributions. This backend is not available on Windows.
"soundfile"
backend requires SoundFile
. Please refer to the SoundFile documentation for the installation.
Common Data Structure¶
Structures used to report the metadata of audio files.
AudioMetaData¶
-
class
torchaudio.backend.common.
AudioMetaData
(sample_rate: int, num_frames: int, num_channels: int, bits_per_sample: int, encoding: str)[source]¶ Return type of
torchaudio.info
function.This class is used by “sox_io” backend and “soundfile” backend with the new interface.
- Variables
sample_rate (int) – Sample rate
num_frames (int) – The number of frames
num_channels (int) – The number of channels
bits_per_sample (int) – The number of bits per sample. This is 0 for lossy formats, or when it cannot be accurately inferred.
encoding (str) –
Audio encoding The values encoding can take are one of the following:
PCM_S
: Signed integer linear PCMPCM_U
: Unsigned integer linear PCMPCM_F
: Floating point linear PCMFLAC
: Flac, Free Lossless Audio CodecULAW
: Mu-lawALAW
: A-lawMP3
: MP3, MPEG-1 Audio Layer IIIVORBIS
: OGG VorbisAMR_WB
: Adaptive Multi-RateAMR_NB
: Adaptive Multi-Rate WidebandOPUS
: OpusHTK
: Single channel 16-bit PCMUNKNOWN
: None of above
- Tutorials using
AudioMetaData
:
Sox IO Backend¶
The sox_io
backend is available and default on Linux/macOS and not available on Windows.
I/O functions of this backend support TorchScript.
You can switch from another backend to the sox_io
backend with the following;
torchaudio.set_audio_backend("sox_io")
info¶
-
torchaudio.backend.sox_io_backend.
info
(filepath: str, format: Optional[str] = None) → torchaudio.backend.common.AudioMetaData[source]¶ Get signal information of an audio file.
- Parameters
filepath (path-like object or file-like object) –
Source of audio data. When the function is not compiled by TorchScript, (e.g.
torch.jit.script
), the following types are accepted;path-like
: file pathfile-like
: Object withread(size: int) -> bytes
method, which returns byte string of at mostsize
length.
When the function is compiled by TorchScript, only
str
type is allowed.Note
When the input type is file-like object, this function cannot get the correct length (
num_samples
) for certain formats, such asmp3
andvorbis
. In this case, the value ofnum_samples
is0
.This argument is intentionally annotated as
str
only due to TorchScript compiler compatibility.
format (str or None, optional) – Override the format detection with the given format. Providing the argument might help when libsox can not infer the format from header or extension.
- Returns
Metadata of the given audio.
- Return type
load¶
-
torchaudio.backend.sox_io_backend.
load
(filepath: str, frame_offset: int = 0, num_frames: int = - 1, normalize: bool = True, channels_first: bool = True, format: Optional[str] = None) → Tuple[torch.Tensor, int][source]¶ Load audio data from file.
Note
This function can handle all the codecs that underlying libsox can handle, however it is tested on the following formats;
WAV, AMB
32-bit floating-point
32-bit signed integer
24-bit signed integer
16-bit signed integer
8-bit unsigned integer (WAV only)
MP3
FLAC
OGG/VORBIS
OPUS
SPHERE
AMR-NB
To load
MP3
,FLAC
,OGG/VORBIS
,OPUS
and other codecslibsox
does not handle natively, your installation oftorchaudio
has to be linked tolibsox
and corresponding codec libraries such aslibmad
orlibmp3lame
etc.By default (
normalize=True
,channels_first=True
), this function returns Tensor withfloat32
dtype, and the shape of [channel, time].Warning
normalize
argument does not perform volume normalization. It only converts the sample type to torch.float32 from the native sample type.When the input format is WAV with integer type, such as 32-bit signed integer, 16-bit signed integer, 24-bit signed integer, and 8-bit unsigned integer, by providing
normalize=False
, this function can return integer Tensor, where the samples are expressed within the whole range of the corresponding dtype, that is,int32
tensor for 32-bit signed PCM,int16
for 16-bit signed PCM anduint8
for 8-bit unsigned PCM. Since torch does not supportint24
dtype, 24-bit signed PCM are converted toint32
tensors.normalize
argument has no effect on 32-bit floating-point WAV and other formats, such asflac
andmp3
.For these formats, this function always returns
float32
Tensor with values.- Parameters
filepath (path-like object or file-like object) –
Source of audio data. When the function is not compiled by TorchScript, (e.g.
torch.jit.script
), the following types are accepted;path-like
: file pathfile-like
: Object withread(size: int) -> bytes
method, which returns byte string of at mostsize
length.
When the function is compiled by TorchScript, only
str
type is allowed.Note: This argument is intentionally annotated as
str
only due to TorchScript compiler compatibility.frame_offset (int) – Number of frames to skip before start reading data.
num_frames (int, optional) – Maximum number of frames to read.
-1
reads all the remaining samples, starting fromframe_offset
. This function may return the less number of frames if there is not enough frames in the given file.normalize (bool, optional) –
When
True
, this function converts the native sample type tofloat32
. Default:True
.If input file is integer WAV, giving
False
will change the resulting Tensor type to integer type. This argument has no effect for formats other than integer WAV type.channels_first (bool, optional) – When True, the returned Tensor has dimension [channel, time]. Otherwise, the returned Tensor’s dimension is [time, channel].
format (str or None, optional) – Override the format detection with the given format. Providing the argument might help when libsox can not infer the format from header or extension.
- Returns
- Resulting Tensor and sample rate.
If the input file has integer wav format and
normalize=False
, then it has integer type, elsefloat32
type. Ifchannels_first=True
, it has [channel, time] else [time, channel].
- Return type
(torch.Tensor, int)
save¶
-
torchaudio.backend.sox_io_backend.
save
(filepath: str, src: torch.Tensor, sample_rate: int, channels_first: bool = True, compression: Optional[float] = None, format: Optional[str] = None, encoding: Optional[str] = None, bits_per_sample: Optional[int] = None)[source]¶ Save audio data to file.
- Parameters
filepath (str or pathlib.Path) – Path to save file. This function also handles
pathlib.Path
objects, but is annotated asstr
for TorchScript compiler compatibility.src (torch.Tensor) – Audio data to save. must be 2D tensor.
sample_rate (int) – sampling rate
channels_first (bool, optional) – If
True
, the given tensor is interpreted as [channel, time], otherwise [time, channel].compression (float or None, optional) –
Used for formats other than WAV. This corresponds to
-C
option ofsox
command."mp3"
Either bitrate (in
kbps
) with quality factor, such as128.2
, or VBR encoding with quality factor such as-4.2
. Default:-4.5
."flac"
Whole number from
0
to8
.8
is default and highest compression."ogg"
,"vorbis"
Number from
-1
to10
;-1
is the highest compression and lowest quality. Default:3
.
See the detail at http://sox.sourceforge.net/soxformat.html.
format (str or None, optional) –
Override the audio format. When
filepath
argument is path-like object, audio format is infered from file extension. If file extension is missing or different, you can specify the correct format with this argument.When
filepath
argument is file-like object, this argument is required.Valid values are
"wav"
,"mp3"
,"ogg"
,"vorbis"
,"amr-nb"
,"amb"
,"flac"
,"sph"
,"gsm"
, and"htk"
.encoding (str or None, optional) –
Changes the encoding for the supported formats. This argument is effective only for supported formats, such as
"wav"
,""amb"
and"sph"
. Valid values are;"PCM_S"
(signed integer Linear PCM)"PCM_U"
(unsigned integer Linear PCM)"PCM_F"
(floating point PCM)"ULAW"
(mu-law)"ALAW"
(a-law)
- Default values
If not provided, the default value is picked based on
format
andbits_per_sample
."wav"
,"amb"
- If both
encoding
andbits_per_sample
are not provided, thedtype
of theTensor is used to determine the default value."PCM_U"
if dtype isuint8
"PCM_S"
if dtype isint16
orint32
"PCM_F"
if dtype isfloat32
"PCM_U"
ifbits_per_sample=8
"PCM_S"
otherwise
"sph"
format;the default value is
"PCM_S"
bits_per_sample (int or None, optional) –
Changes the bit depth for the supported formats. When
format
is one of"wav"
,"flac"
,"sph"
, or"amb"
, you can change the bit depth. Valid values are8
,16
,32
and64
.- Default Value;
If not provided, the default values are picked based on
format
and"encoding"
;"wav"
,"amb"
;- If both
encoding
andbits_per_sample
are not provided, thedtype
of theTensor is used.8
if dtype isuint8
16
if dtype isint16
32
if dtype isint32
orfloat32
8
ifencoding
is"PCM_U"
,"ULAW"
or"ALAW"
16
ifencoding
is"PCM_S"
32
ifencoding
is"PCM_F"
"flac"
format;the default value is
24
"sph"
format;16
ifencoding
is"PCM_U"
,"PCM_S"
,"PCM_F"
or not provided.8
ifencoding
is"ULAW"
or"ALAW"
"amb"
format;8
ifencoding
is"PCM_U"
,"ULAW"
or"ALAW"
16
ifencoding
is"PCM_S"
or not provided.32
ifencoding
is"PCM_F"
Supported formats/encodings/bit depth/compression are;
"wav"
,"amb"
32-bit floating-point PCM
32-bit signed integer PCM
24-bit signed integer PCM
16-bit signed integer PCM
8-bit unsigned integer PCM
8-bit mu-law
8-bit a-law
Note: Default encoding/bit depth is determined by the dtype of the input Tensor.
"mp3"
Fixed bit rate (such as 128kHz) and variable bit rate compression. Default: VBR with high quality.
"flac"
8-bit
16-bit
24-bit (default)
"ogg"
,"vorbis"
Different quality level. Default: approx. 112kbps
"sph"
8-bit signed integer PCM
16-bit signed integer PCM
24-bit signed integer PCM
32-bit signed integer PCM (default)
8-bit mu-law
8-bit a-law
16-bit a-law
24-bit a-law
32-bit a-law
"amr-nb"
Bitrate ranging from 4.75 kbit/s to 12.2 kbit/s. Default: 4.75 kbit/s
"gsm"
Lossy Speech Compression, CPU intensive.
"htk"
Uses a default single-channel 16-bit PCM format.
Note
To save into formats that
libsox
does not handle natively, (such as"mp3"
,"flac"
,"ogg"
and"vorbis"
), your installation oftorchaudio
has to be linked tolibsox
and corresponding codec libraries such aslibmad
orlibmp3lame
etc.
Soundfile Backend¶
The "soundfile"
backend is available when SoundFile is installed. This backend is the default on Windows.
You can switch from another backend to the "soundfile"
backend with the following;
torchaudio.set_audio_backend("soundfile")
info¶
-
torchaudio.backend.soundfile_backend.
info
(filepath: str, format: Optional[str] = None) → torchaudio.backend.common.AudioMetaData[source]¶ Get signal information of an audio file.
Note
filepath
argument is intentionally annotated asstr
only, even though it acceptspathlib.Path
object as well. This is for the consistency with"sox_io"
backend, which has a restriction on type annotation due to TorchScript compiler compatiblity.- Parameters
- Returns
meta data of the given audio.
- Return type
load¶
-
torchaudio.backend.soundfile_backend.
load
(filepath: str, frame_offset: int = 0, num_frames: int = - 1, normalize: bool = True, channels_first: bool = True, format: Optional[str] = None) → Tuple[torch.Tensor, int][source]¶ Load audio data from file.
Note
The formats this function can handle depend on the soundfile installation. This function is tested on the following formats;
WAV
32-bit floating-point
32-bit signed integer
16-bit signed integer
8-bit unsigned integer
FLAC
OGG/VORBIS
SPHERE
By default (
normalize=True
,channels_first=True
), this function returns Tensor withfloat32
dtype, and the shape of [channel, time].Warning
normalize
argument does not perform volume normalization. It only converts the sample type to torch.float32 from the native sample type.When the input format is WAV with integer type, such as 32-bit signed integer, 16-bit signed integer, 24-bit signed integer, and 8-bit unsigned integer, by providing
normalize=False
, this function can return integer Tensor, where the samples are expressed within the whole range of the corresponding dtype, that is,int32
tensor for 32-bit signed PCM,int16
for 16-bit signed PCM anduint8
for 8-bit unsigned PCM. Since torch does not supportint24
dtype, 24-bit signed PCM are converted toint32
tensors.normalize
argument has no effect on 32-bit floating-point WAV and other formats, such asflac
andmp3
.For these formats, this function always returns
float32
Tensor with values.Note
filepath
argument is intentionally annotated asstr
only, even though it acceptspathlib.Path
object as well. This is for the consistency with"sox_io"
backend, which has a restriction on type annotation due to TorchScript compiler compatiblity.- Parameters
filepath (path-like object or file-like object) – Source of audio data.
frame_offset (int, optional) – Number of frames to skip before start reading data.
num_frames (int, optional) – Maximum number of frames to read.
-1
reads all the remaining samples, starting fromframe_offset
. This function may return the less number of frames if there is not enough frames in the given file.normalize (bool, optional) –
When
True
, this function converts the native sample type tofloat32
. Default:True
.If input file is integer WAV, giving
False
will change the resulting Tensor type to integer type. This argument has no effect for formats other than integer WAV type.channels_first (bool, optional) – When True, the returned Tensor has dimension [channel, time]. Otherwise, the returned Tensor’s dimension is [time, channel].
format (str or None, optional) – Not used. PySoundFile does not accept format hint.
- Returns
- Resulting Tensor and sample rate.
If the input file has integer wav format and normalization is off, then it has integer type, else
float32
type. Ifchannels_first=True
, it has [channel, time] else [time, channel].
- Return type
(torch.Tensor, int)
save¶
-
torchaudio.backend.soundfile_backend.
save
(filepath: str, src: torch.Tensor, sample_rate: int, channels_first: bool = True, compression: Optional[float] = None, format: Optional[str] = None, encoding: Optional[str] = None, bits_per_sample: Optional[int] = None)[source]¶ Save audio data to file.
Note
The formats this function can handle depend on the soundfile installation. This function is tested on the following formats;
WAV
32-bit floating-point
32-bit signed integer
16-bit signed integer
8-bit unsigned integer
FLAC
OGG/VORBIS
SPHERE
Note
filepath
argument is intentionally annotated asstr
only, even though it acceptspathlib.Path
object as well. This is for the consistency with"sox_io"
backend, which has a restriction on type annotation due to TorchScript compiler compatiblity.- Parameters
filepath (str or pathlib.Path) – Path to audio file.
src (torch.Tensor) – Audio data to save. must be 2D tensor.
sample_rate (int) – sampling rate
channels_first (bool, optional) – If
True
, the given tensor is interpreted as [channel, time], otherwise [time, channel].compression (float of None, optional) – Not used. It is here only for interface compatibility reson with “sox_io” backend.
format (str or None, optional) –
Override the audio format. When
filepath
argument is path-like object, audio format is inferred from file extension. If the file extension is missing or different, you can specify the correct format with this argument.When
filepath
argument is file-like object, this argument is required.Valid values are
"wav"
,"ogg"
,"vorbis"
,"flac"
and"sph"
.encoding (str or None, optional) –
Changes the encoding for supported formats. This argument is effective only for supported formats, sush as
"wav"
,""flac"
and"sph"
. Valid values are;"PCM_S"
(signed integer Linear PCM)"PCM_U"
(unsigned integer Linear PCM)"PCM_F"
(floating point PCM)"ULAW"
(mu-law)"ALAW"
(a-law)
bits_per_sample (int or None, optional) – Changes the bit depth for the supported formats. When
format
is one of"wav"
,"flac"
or"sph"
, you can change the bit depth. Valid values are8
,16
,24
,32
and64
.
Supported formats/encodings/bit depth/compression are:
"wav"
32-bit floating-point PCM
32-bit signed integer PCM
24-bit signed integer PCM
16-bit signed integer PCM
8-bit unsigned integer PCM
8-bit mu-law
8-bit a-law
- Note:
Default encoding/bit depth is determined by the dtype of the input Tensor.
"flac"
8-bit
16-bit (default)
24-bit
"ogg"
,"vorbis"
Doesn’t accept changing configuration.
"sph"
8-bit signed integer PCM
16-bit signed integer PCM
24-bit signed integer PCM
32-bit signed integer PCM (default)
8-bit mu-law
8-bit a-law
16-bit a-law
24-bit a-law
32-bit a-law