Shortcuts

torchaudio.models

The models subpackage contains definitions of models for addressing common audio tasks.

ConvTasNet

class torchaudio.models.ConvTasNet(num_sources: int = 2, enc_kernel_size: int = 16, enc_num_feats: int = 512, msk_kernel_size: int = 3, msk_num_feats: int = 128, msk_num_hidden_feats: int = 512, msk_num_layers: int = 8, msk_num_stacks: int = 3)[source]

Conv-TasNet: a fully-convolutional time-domain audio separation network 1.

Parameters
  • num_sources (int) – The number of sources to split.

  • enc_kernel_size (int) – The convolution kernel size of the encoder/decoder, <L>.

  • enc_num_feats (int) – The feature dimensions passed to mask generator, <N>.

  • msk_kernel_size (int) – The convolution kernel size of the mask generator, <P>.

  • msk_num_feats (int) – The input/output feature dimension of conv block in the mask generator, <B, Sc>.

  • msk_num_hidden_feats (int) – The internal feature dimension of conv block of the mask generator, <H>.

  • msk_num_layers (int) – The number of layers in one conv block of the mask generator, <X>.

  • msk_num_stacks (int) – The numbr of conv blocks of the mask generator, <R>.

Note

This implementation corresponds to the “non-causal” setting in the paper.

forward(input: torch.Tensor) → torch.Tensor[source]

Perform source separation. Generate audio source waveforms.

Parameters

input (torch.Tensor) – 3D Tensor with shape [batch, channel==1, frames]

Returns

3D Tensor with shape [batch, channel==num_sources, frames]

Return type

torch.Tensor

DeepSpeech

class torchaudio.models.DeepSpeech(n_feature: int, n_hidden: int = 2048, n_class: int = 40, dropout: float = 0.0)[source]

DeepSpeech model architecture from 2.

Parameters
  • n_feature – Number of input features

  • n_hidden – Internal hidden unit size.

  • n_class – Number of output classes

forward(x: torch.Tensor) → torch.Tensor[source]
Parameters

x (torch.Tensor) – Tensor of dimension (batch, channel, time, feature).

Returns

Predictor tensor of dimension (batch, time, class).

Return type

Tensor

Wav2Letter

class torchaudio.models.Wav2Letter(num_classes: int = 40, input_type: str = 'waveform', num_features: int = 1)[source]

Wav2Letter model architecture from 3.

\(\text{padding} = \frac{\text{ceil}(\text{kernel} - \text{stride})}{2}\)

Parameters
  • num_classes (int, optional) – Number of classes to be classified. (Default: 40)

  • input_type (str, optional) – Wav2Letter can use as input: waveform, power_spectrum or mfcc (Default: waveform).

  • num_features (int, optional) – Number of input features that the network will receive (Default: 1).

forward(x: torch.Tensor) → torch.Tensor[source]
Parameters

x (torch.Tensor) – Tensor of dimension (batch_size, num_features, input_length).

Returns

Predictor tensor of dimension (batch_size, number_of_classes, input_length).

Return type

Tensor

Wav2Vec2.0

Wav2Vec2Model

class torchaudio.models.Wav2Vec2Model(feature_extractor: torch.nn.modules.module.Module, encoder: torch.nn.modules.module.Module)[source]

Encoder model used in [4].

Note

To build the model, please use one of the factory functions.

Parameters
  • feature_extractor (torch.nn.Module) – Feature extractor that extracts feature vectors from raw audio Tensor.

  • encoder (torch.nn.Module) – Encoder that converts the audio features into the sequence of probability distribution (in negative log-likelihood) over labels.

extract_features(waveforms: torch.Tensor, lengths: Optional[torch.Tensor] = None) → Tuple[torch.Tensor, Optional[torch.Tensor]][source]

Extract feature vectors from raw waveforms

Parameters
  • waveforms (Tensor) – Audio tensor of shape (batch, frames).

  • lengths (Tensor, optional) – Indicates the valid length of each audio sample in the batch. Shape: (batch, ).

Returns

Feature vectors.

Shape: (batch, frames, feature dimention)

Tensor, optional:

Indicates the valid length of each feature in the batch, computed based on the given lengths argument. Shape: (batch, ).

Return type

Tensor

forward(waveforms: torch.Tensor, lengths: Optional[torch.Tensor] = None) → Tuple[torch.Tensor, Optional[torch.Tensor]][source]

Compute the sequence of probability distribution over labels.

Parameters
  • waveforms (Tensor) – Audio tensor of shape (batch, frames).

  • lengths (Tensor, optional) – Indicates the valid length of each audio sample in the batch. Shape: (batch, ).

Returns

The sequences of probability distribution (in logit) over labels.

Shape: (batch, frames, num labels).

Tensor, optional:

Indicates the valid length of each feature in the batch, computed based on the given lengths argument. Shape: (batch, ).

Return type

Tensor

Factory Functions

wav2vec2_base

torchaudio.models.wav2vec2_base(num_out: int) → torchaudio.models.wav2vec2.model.Wav2Vec2Model[source]

Build wav2vec2.0 model with “Base” configuration from [4].

Parameters

num_out – int The number of output labels.

Returns

The resulting model.

Return type

Wav2Vec2Model

Example - Reload fine-tuned model from Hugging Face:
>>> # Session 1 - Convert pretrained model from Hugging Face and save the parameters.
>>> from torchaudio.models.wav2vec2.utils import import_huggingface_model
>>>
>>> original = Wav2Vec2ForCTC.from_pretrained("facebook/wav2vec2-base-960h")
>>> model = import_huggingface_model(original)
>>> torch.save(model.state_dict(), "wav2vec2-base-960h.pt")
>>>
>>> # Session 2 - Load model and the parameters
>>> model = wav2vec2_base(num_out=32)
>>> model.load_state_dict(torch.load("wav2vec2-base-960h.pt"))

wav2vec2_large

torchaudio.models.wav2vec2_large(num_out: int) → torchaudio.models.wav2vec2.model.Wav2Vec2Model[source]

Build wav2vec2.0 model with “Large” configuration from [4].

Parameters

num_out – int The number of output labels.

Returns

The resulting model.

Return type

Wav2Vec2Model

Example - Reload fine-tuned model from Hugging Face:
>>> # Session 1 - Convert pretrained model from Hugging Face and save the parameters.
>>> from torchaudio.models.wav2vec2.utils import import_huggingface_model
>>>
>>> original = Wav2Vec2ForCTC.from_pretrained("facebook/wav2vec2-large-960h")
>>> model = import_huggingface_model(original)
>>> torch.save(model.state_dict(), "wav2vec2-base-960h.pt")
>>>
>>> # Session 2 - Load model and the parameters
>>> model = wav2vec2_large(num_out=32)
>>> model.load_state_dict(torch.load("wav2vec2-base-960h.pt"))

wav2vec2_large_lv60k

torchaudio.models.wav2vec2_large_lv60k(num_out: int) → torchaudio.models.wav2vec2.model.Wav2Vec2Model[source]

Build wav2vec2.0 model with “Large LV-60k” configuration from [4].

Parameters

num_out – int The number of output labels.

Returns

The resulting model.

Return type

Wav2Vec2Model

Example - Reload fine-tuned model from Hugging Face:
>>> # Session 1 - Convert pretrained model from Hugging Face and save the parameters.
>>> from torchaudio.models.wav2vec2.utils import import_huggingface_model
>>>
>>> original = Wav2Vec2ForCTC.from_pretrained("facebook/wav2vec2-large-960h-lv60-self")
>>> model = import_huggingface_model(original)
>>> torch.save(model.state_dict(), "wav2vec2-base-960h.pt")
>>>
>>> # Session 2 - Load model and the parameters
>>> model = wav2vec2_large_lv60k(num_out=32)
>>> model.load_state_dict(torch.load("wav2vec2-base-960h.pt"))

Utility Functions

import_huggingface_model

torchaudio.models.wav2vec2.utils.import_huggingface_model(original: torch.nn.modules.module.Module) → torchaudio.models.wav2vec2.model.Wav2Vec2Model[source]

Import wav2vec2 model from Hugging Face’s Transformers.

Parameters

original (torch.nn.Module) – An instance of Wav2Vec2ForCTC from transformers.

Returns

Imported model.

Return type

Wav2Vec2Model

Example
>>> from torchaudio.models.wav2vec2.utils import import_huggingface_model
>>>
>>> original = Wav2Vec2ForCTC.from_pretrained("facebook/wav2vec2-base-960h")
>>> model = import_huggingface_model(original)
>>>
>>> waveforms, _ = torchaudio.load("audio.wav")
>>> logits, _ = model(waveforms)

import_fairseq_model

torchaudio.models.wav2vec2.utils.import_fairseq_model(original: torch.nn.modules.module.Module, num_out: Optional[int] = None) → torchaudio.models.wav2vec2.model.Wav2Vec2Model[source]

Build Wav2Vec2Model from pretrained parameters published by fairseq.

Parameters
  • original (torch.nn.Module) – An instance of fairseq’s Wav2Vec2.0 model class. Either fairseq.models.wav2vec.wav2vec2_asr.Wav2VecEncoder or fairseq.models.wav2vec.wav2vec2.Wav2Vec2Model.

  • num_out (int, optional) – The number of output labels. Required only when the original model is an instance of fairseq.models.wav2vec.wav2vec2.Wav2Vec2Model.

Returns

Imported model.

Return type

Wav2Vec2Model

Example - Loading pretrain-only model
>>> from torchaudio.models.wav2vec2.utils import import_fairseq_model
>>>
>>> # Load model using fairseq
>>> model_file = 'wav2vec_small.pt'
>>> model, _, _ = fairseq.checkpoint_utils.load_model_ensemble_and_task([model_file])
>>> original = model[0]
>>> imported = import_fairseq_model(original, num_out=28)
>>>
>>> # Perform feature extraction
>>> waveform, _ = torchaudio.load('audio.wav')
>>> features, _ = imported.extract_features(waveform)
>>>
>>> # Compare result with the original model from fairseq
>>> reference = original.feature_extractor(waveform).transpose(1, 2)
>>> torch.testing.assert_allclose(features, reference)
Example - Fine-tuned model
>>> from torchaudio.models.wav2vec2.utils import import_fairseq_model
>>>
>>> # Load model using fairseq
>>> model_file = 'wav2vec_small_960h.pt'
>>> model, _, _ = fairseq.checkpoint_utils.load_model_ensemble_and_task([model_file])
>>> original = model[0]
>>> imported = import_fairseq_model(original.w2v_encoder)
>>>
>>> # Perform encoding
>>> waveform, _ = torchaudio.load('audio.wav')
>>> emission, _ = imported(waveform)
>>>
>>> # Compare result with the original model from fairseq
>>> mask = torch.zeros_like(waveform)
>>> reference = original(waveform, mask)['encoder_out'].transpose(0, 1)
>>> torch.testing.assert_allclose(emission, reference)

WaveRNN

class torchaudio.models.WaveRNN(upsample_scales: List[int], n_classes: int, hop_length: int, n_res_block: int = 10, n_rnn: int = 512, n_fc: int = 512, kernel_size: int = 5, n_freq: int = 128, n_hidden: int = 128, n_output: int = 128)[source]

WaveRNN model based on the implementation from fatchord.

The original implementation was introduced in 5. The input channels of waveform and spectrogram have to be 1. The product of upsample_scales must equal hop_length.

Parameters
  • upsample_scales – the list of upsample scales.

  • n_classes – the number of output classes.

  • hop_length – the number of samples between the starts of consecutive frames.

  • n_res_block – the number of ResBlock in stack. (Default: 10)

  • n_rnn – the dimension of RNN layer. (Default: 512)

  • n_fc – the dimension of fully connected layer. (Default: 512)

  • kernel_size – the number of kernel size in the first Conv1d layer. (Default: 5)

  • n_freq – the number of bins in a spectrogram. (Default: 128)

  • n_hidden – the number of hidden dimensions of resblock. (Default: 128)

  • n_output – the number of output dimensions of melresnet. (Default: 128)

Example
>>> wavernn = WaveRNN(upsample_scales=[5,5,8], n_classes=512, hop_length=200)
>>> waveform, sample_rate = torchaudio.load(file)
>>> # waveform shape: (n_batch, n_channel, (n_time - kernel_size + 1) * hop_length)
>>> specgram = MelSpectrogram(sample_rate)(waveform)  # shape: (n_batch, n_channel, n_freq, n_time)
>>> output = wavernn(waveform, specgram)
>>> # output shape: (n_batch, n_channel, (n_time - kernel_size + 1) * hop_length, n_classes)
forward(waveform: torch.Tensor, specgram: torch.Tensor) → torch.Tensor[source]

Pass the input through the WaveRNN model.

Parameters
  • waveform – the input waveform to the WaveRNN layer (n_batch, 1, (n_time - kernel_size + 1) * hop_length)

  • specgram – the input spectrogram to the WaveRNN layer (n_batch, 1, n_freq, n_time)

Returns

(n_batch, 1, (n_time - kernel_size + 1) * hop_length, n_classes)

Return type

Tensor shape

References

1

Yi Luo and Nima Mesgarani. Conv-tasnet: surpassing ideal time–frequency magnitude masking for speech separation. IEEE/ACM Transactions on Audio, Speech, and Language Processing, 27(8):1256–1266, Aug 2019. URL: http://dx.doi.org/10.1109/TASLP.2019.2915167, doi:10.1109/taslp.2019.2915167.

2

Awni Hannun, Carl Case, Jared Casper, Bryan Catanzaro, Greg Diamos, Erich Elsen, Ryan Prenger, Sanjeev Satheesh, Shubho Sengupta, Adam Coates, and Andrew Y. Ng. Deep speech: scaling up end-to-end speech recognition. 2014. arXiv:1412.5567.

3

Ronan Collobert, Christian Puhrsch, and Gabriel Synnaeve. Wav2letter: an end-to-end convnet-based speech recognition system. 2016. arXiv:1609.03193.

4(1,2,3,4)

Alexei Baevski, Henry Zhou, Abdelrahman Mohamed, and Michael Auli. Wav2vec 2.0: a framework for self-supervised learning of speech representations. 2020. arXiv:2006.11477.

5

Nal Kalchbrenner, Erich Elsen, Karen Simonyan, Seb Noury, Norman Casagrande, Edward Lockhart, Florian Stimberg, Aaron van den Oord, Sander Dieleman, and Koray Kavukcuoglu. Efficient neural audio synthesis. 2018. arXiv:1802.08435.

Docs

Access comprehensive developer documentation for PyTorch

View Docs

Tutorials

Get in-depth tutorials for beginners and advanced developers

View Tutorials

Resources

Find development resources and get your questions answered

View Resources