torchaudio.models

The models subpackage contains definitions of models for addressing common audio tasks.

ConvTasNet

class torchaudio.models.ConvTasNet(num_sources: int = 2, enc_kernel_size: int = 16, enc_num_feats: int = 512, msk_kernel_size: int = 3, msk_num_feats: int = 128, msk_num_hidden_feats: int = 512, msk_num_layers: int = 8, msk_num_stacks: int = 3)[source]

Conv-TasNet: a fully-convolutional time-domain audio separation network 1.

Parameters

num_sources (int) – The number of sources to split.
enc_kernel_size (int) – The convolution kernel size of the encoder/decoder, <L>.
enc_num_feats (int) – The feature dimensions passed to mask generator, <N>.
msk_kernel_size (int) – The convolution kernel size of the mask generator, <P>.
msk_num_feats (int) – The input/output feature dimension of conv block in the mask generator, <B, Sc>.
msk_num_hidden_feats (int) – The internal feature dimension of conv block of the mask generator, <H>.
msk_num_layers (int) – The number of layers in one conv block of the mask generator, <X>.
msk_num_stacks (int) – The numbr of conv blocks of the mask generator, <R>.

Note

This implementation corresponds to the “non-causal” setting in the paper.

forward(input: torch.Tensor) → torch.Tensor[source]

Perform source separation. Generate audio source waveforms.

Parameters: input (torch.Tensor) – 3D Tensor with shape [batch, channel==1, frames]
Returns: 3D Tensor with shape [batch, channel==num_sources, frames]
Return type: torch.Tensor

DeepSpeech

class torchaudio.models.DeepSpeech(n_feature: int, n_hidden: int = 2048, n_class: int = 40, dropout: float = 0.0)[source]

DeepSpeech model architecture from 2.

Parameters

n_feature – Number of input features
n_hidden – Internal hidden unit size.
n_class – Number of output classes

forward(x: torch.Tensor) → torch.Tensor[source]

Parameters: x (torch.Tensor) – Tensor of dimension (batch, channel, time, feature).
Returns: Predictor tensor of dimension (batch, time, class).
Return type: Tensor

Wav2Letter

class torchaudio.models.Wav2Letter(num_classes: int = 40, input_type: str = 'waveform', num_features: int = 1)[source]

Wav2Letter model architecture from 3.

$\text{padding} = \frac{\text{ceil}(\text{kernel} - \text{stride})}{2}$

Parameters

num_classes (int, optional) – Number of classes to be classified. (Default: 40)
input_type (str, optional) – Wav2Letter can use as input: waveform, power_spectrum or mfcc (Default: waveform).
num_features (int, optional) – Number of input features that the network will receive (Default: 1).

forward(x: torch.Tensor) → torch.Tensor[source]

Parameters: x (torch.Tensor) – Tensor of dimension (batch_size, num_features, input_length).
Returns: Predictor tensor of dimension (batch_size, number_of_classes, input_length).
Return type: Tensor

Wav2Vec2.0

Wav2Vec2Model

class torchaudio.models.Wav2Vec2Model(feature_extractor: torch.nn.modules.module.Module, encoder: torch.nn.modules.module.Module)[source]

Encoder model used in [4].

Note

To build the model, please use one of the factory functions.

Parameters

feature_extractor (torch.nn.Module) – Feature extractor that extracts feature vectors from raw audio Tensor.
encoder (torch.nn.Module) – Encoder that converts the audio features into the sequence of probability distribution (in negative log-likelihood) over labels.

extract_features(waveforms: torch.Tensor, lengths: Optional[torch.Tensor] = None) → Tuple[torch.Tensor, Optional[torch.Tensor]][source]

Extract feature vectors from raw waveforms

Parameters

waveforms (Tensor) – Audio tensor of shape (batch, frames).
lengths (Tensor, optional) – Indicates the valid length of each audio sample in the batch. Shape: (batch, ).

Returns

Feature vectors.: Shape: (batch, frames, feature dimention)
Tensor, optional:: Indicates the valid length of each feature in the batch, computed based on the given lengths argument. Shape: (batch, ).

Return type

Tensor

forward(waveforms: torch.Tensor, lengths: Optional[torch.Tensor] = None) → Tuple[torch.Tensor, Optional[torch.Tensor]][source]

Compute the sequence of probability distribution over labels.

Parameters

waveforms (Tensor) – Audio tensor of shape (batch, frames).
lengths (Tensor, optional) – Indicates the valid length of each audio sample in the batch. Shape: (batch, ).

Returns

The sequences of probability distribution (in logit) over labels.: Shape: (batch, frames, num labels).
Tensor, optional:: Indicates the valid length of each feature in the batch, computed based on the given lengths argument. Shape: (batch, ).

Return type

Tensor

Factory Functions

wav2vec2_base

torchaudio.models.wav2vec2_base(num_out: int) → torchaudio.models.wav2vec2.model.Wav2Vec2Model[source]

Build wav2vec2.0 model with “Base” configuration from [4].

Parameters: num_out – int The number of output labels.
Returns: The resulting model.
Return type: Wav2Vec2Model

Example - Reload fine-tuned model from Hugging Face:

>>> # Session 1 - Convert pretrained model from Hugging Face and save the parameters.
>>> from torchaudio.models.wav2vec2.utils import import_huggingface_model
>>>
>>> original = Wav2Vec2ForCTC.from_pretrained("facebook/wav2vec2-base-960h")
>>> model = import_huggingface_model(original)
>>> torch.save(model.state_dict(), "wav2vec2-base-960h.pt")
>>>
>>> # Session 2 - Load model and the parameters
>>> model = wav2vec2_base(num_out=32)
>>> model.load_state_dict(torch.load("wav2vec2-base-960h.pt"))

wav2vec2_large

torchaudio.models.wav2vec2_large(num_out: int) → torchaudio.models.wav2vec2.model.Wav2Vec2Model[source]

Build wav2vec2.0 model with “Large” configuration from [4].

Parameters: num_out – int The number of output labels.
Returns: The resulting model.
Return type: Wav2Vec2Model

Example - Reload fine-tuned model from Hugging Face:

>>> # Session 1 - Convert pretrained model from Hugging Face and save the parameters.
>>> from torchaudio.models.wav2vec2.utils import import_huggingface_model
>>>
>>> original = Wav2Vec2ForCTC.from_pretrained("facebook/wav2vec2-large-960h")
>>> model = import_huggingface_model(original)
>>> torch.save(model.state_dict(), "wav2vec2-base-960h.pt")
>>>
>>> # Session 2 - Load model and the parameters
>>> model = wav2vec2_large(num_out=32)
>>> model.load_state_dict(torch.load("wav2vec2-base-960h.pt"))

wav2vec2_large_lv60k

torchaudio.models.wav2vec2_large_lv60k(num_out: int) → torchaudio.models.wav2vec2.model.Wav2Vec2Model[source]

Build wav2vec2.0 model with “Large LV-60k” configuration from [4].

Parameters: num_out – int The number of output labels.
Returns: The resulting model.
Return type: Wav2Vec2Model

Example - Reload fine-tuned model from Hugging Face:

>>> # Session 1 - Convert pretrained model from Hugging Face and save the parameters.
>>> from torchaudio.models.wav2vec2.utils import import_huggingface_model
>>>
>>> original = Wav2Vec2ForCTC.from_pretrained("facebook/wav2vec2-large-960h-lv60-self")
>>> model = import_huggingface_model(original)
>>> torch.save(model.state_dict(), "wav2vec2-base-960h.pt")
>>>
>>> # Session 2 - Load model and the parameters
>>> model = wav2vec2_large_lv60k(num_out=32)
>>> model.load_state_dict(torch.load("wav2vec2-base-960h.pt"))

Utility Functions

import_huggingface_model

torchaudio.models.wav2vec2.utils.import_huggingface_model(original: torch.nn.modules.module.Module) → torchaudio.models.wav2vec2.model.Wav2Vec2Model[source]

Import wav2vec2 model from Hugging Face’s Transformers.

Parameters: original (torch.nn.Module) – An instance of Wav2Vec2ForCTC from transformers.
Returns: Imported model.
Return type: Wav2Vec2Model

Example

>>> from torchaudio.models.wav2vec2.utils import import_huggingface_model
>>>
>>> original = Wav2Vec2ForCTC.from_pretrained("facebook/wav2vec2-base-960h")
>>> model = import_huggingface_model(original)
>>>
>>> waveforms, _ = torchaudio.load("audio.wav")
>>> logits, _ = model(waveforms)

import_fairseq_model

torchaudio.models.wav2vec2.utils.import_fairseq_model(original: torch.nn.modules.module.Module, num_out: Optional[int] = None) → torchaudio.models.wav2vec2.model.Wav2Vec2Model[source]

Build Wav2Vec2Model from pretrained parameters published by fairseq.

Parameters

original (torch.nn.Module) – An instance of fairseq’s Wav2Vec2.0 model class. Either fairseq.models.wav2vec.wav2vec2_asr.Wav2VecEncoder or fairseq.models.wav2vec.wav2vec2.Wav2Vec2Model.
num_out (int, optional) – The number of output labels. Required only when the original model is an instance of fairseq.models.wav2vec.wav2vec2.Wav2Vec2Model.

Returns

Imported model.

Return type

Wav2Vec2Model

Example - Loading pretrain-only model

>>> from torchaudio.models.wav2vec2.utils import import_fairseq_model
>>>
>>> # Load model using fairseq
>>> model_file = 'wav2vec_small.pt'
>>> model, _, _ = fairseq.checkpoint_utils.load_model_ensemble_and_task([model_file])
>>> original = model[0]
>>> imported = import_fairseq_model(original, num_out=28)
>>>
>>> # Perform feature extraction
>>> waveform, _ = torchaudio.load('audio.wav')
>>> features, _ = imported.extract_features(waveform)
>>>
>>> # Compare result with the original model from fairseq
>>> reference = original.feature_extractor(waveform).transpose(1, 2)
>>> torch.testing.assert_allclose(features, reference)

Example - Fine-tuned model

>>> from torchaudio.models.wav2vec2.utils import import_fairseq_model
>>>
>>> # Load model using fairseq
>>> model_file = 'wav2vec_small_960h.pt'
>>> model, _, _ = fairseq.checkpoint_utils.load_model_ensemble_and_task([model_file])
>>> original = model[0]
>>> imported = import_fairseq_model(original.w2v_encoder)
>>>
>>> # Perform encoding
>>> waveform, _ = torchaudio.load('audio.wav')
>>> emission, _ = imported(waveform)
>>>
>>> # Compare result with the original model from fairseq
>>> mask = torch.zeros_like(waveform)
>>> reference = original(waveform, mask)['encoder_out'].transpose(0, 1)
>>> torch.testing.assert_allclose(emission, reference)

WaveRNN

class torchaudio.models.WaveRNN(upsample_scales: List[int], n_classes: int, hop_length: int, n_res_block: int = 10, n_rnn: int = 512, n_fc: int = 512, kernel_size: int = 5, n_freq: int = 128, n_hidden: int = 128, n_output: int = 128)[source]

WaveRNN model based on the implementation from fatchord.

The original implementation was introduced in 5. The input channels of waveform and spectrogram have to be 1. The product of upsample_scales must equal hop_length.

Parameters

upsample_scales – the list of upsample scales.
n_classes – the number of output classes.
hop_length – the number of samples between the starts of consecutive frames.
n_res_block – the number of ResBlock in stack. (Default: 10)
n_rnn – the dimension of RNN layer. (Default: 512)
n_fc – the dimension of fully connected layer. (Default: 512)
kernel_size – the number of kernel size in the first Conv1d layer. (Default: 5)
n_freq – the number of bins in a spectrogram. (Default: 128)
n_hidden – the number of hidden dimensions of resblock. (Default: 128)
n_output – the number of output dimensions of melresnet. (Default: 128)

Example

>>> wavernn = WaveRNN(upsample_scales=[5,5,8], n_classes=512, hop_length=200)
>>> waveform, sample_rate = torchaudio.load(file)
>>> # waveform shape: (n_batch, n_channel, (n_time - kernel_size + 1) * hop_length)
>>> specgram = MelSpectrogram(sample_rate)(waveform)  # shape: (n_batch, n_channel, n_freq, n_time)
>>> output = wavernn(waveform, specgram)
>>> # output shape: (n_batch, n_channel, (n_time - kernel_size + 1) * hop_length, n_classes)

forward(waveform: torch.Tensor, specgram: torch.Tensor) → torch.Tensor[source]

Pass the input through the WaveRNN model.

Parameters

waveform – the input waveform to the WaveRNN layer (n_batch, 1, (n_time - kernel_size + 1) * hop_length)
specgram – the input spectrogram to the WaveRNN layer (n_batch, 1, n_freq, n_time)

Returns

(n_batch, 1, (n_time - kernel_size + 1) * hop_length, n_classes)

Return type

Tensor shape

References

1: Yi Luo and Nima Mesgarani. Conv-tasnet: surpassing ideal time–frequency magnitude masking for speech separation. IEEE/ACM Transactions on Audio, Speech, and Language Processing, 27(8):1256–1266, Aug 2019. URL: http://dx.doi.org/10.1109/TASLP.2019.2915167, doi:10.1109/taslp.2019.2915167.
2: Awni Hannun, Carl Case, Jared Casper, Bryan Catanzaro, Greg Diamos, Erich Elsen, Ryan Prenger, Sanjeev Satheesh, Shubho Sengupta, Adam Coates, and Andrew Y. Ng. Deep speech: scaling up end-to-end speech recognition. 2014. arXiv:1412.5567.
3: Ronan Collobert, Christian Puhrsch, and Gabriel Synnaeve. Wav2letter: an end-to-end convnet-based speech recognition system. 2016. arXiv:1609.03193.
4(1,2,3,4): Alexei Baevski, Henry Zhou, Abdelrahman Mohamed, and Michael Auli. Wav2vec 2.0: a framework for self-supervised learning of speech representations. 2020. arXiv:2006.11477.
5: Nal Kalchbrenner, Erich Elsen, Karen Simonyan, Seb Noury, Norman Casagrande, Edward Lockhart, Florian Stimberg, Aaron van den Oord, Sander Dieleman, and Koray Kavukcuoglu. Efficient neural audio synthesis. 2018. arXiv:1802.08435.

torchaudio.models

ConvTasNet

DeepSpeech

Wav2Letter

Wav2Vec2.0

Wav2Vec2Model

Factory Functions

wav2vec2_base

wav2vec2_large

wav2vec2_large_lv60k

Utility Functions

import_huggingface_model

import_fairseq_model

WaveRNN

References

Docs

Tutorials

Resources