torchaudio.models¶
The models subpackage contains definitions of models for addressing common audio tasks.
ConvTasNet¶
-
class
torchaudio.models.
ConvTasNet
(num_sources: int = 2, enc_kernel_size: int = 16, enc_num_feats: int = 512, msk_kernel_size: int = 3, msk_num_feats: int = 128, msk_num_hidden_feats: int = 512, msk_num_layers: int = 8, msk_num_stacks: int = 3)[source]¶ Conv-TasNet: a fully-convolutional time-domain audio separation network 1.
- Parameters
num_sources (int) – The number of sources to split.
enc_kernel_size (int) – The convolution kernel size of the encoder/decoder, <L>.
enc_num_feats (int) – The feature dimensions passed to mask generator, <N>.
msk_kernel_size (int) – The convolution kernel size of the mask generator, <P>.
msk_num_feats (int) – The input/output feature dimension of conv block in the mask generator, <B, Sc>.
msk_num_hidden_feats (int) – The internal feature dimension of conv block of the mask generator, <H>.
msk_num_layers (int) – The number of layers in one conv block of the mask generator, <X>.
msk_num_stacks (int) – The numbr of conv blocks of the mask generator, <R>.
Note
This implementation corresponds to the “non-causal” setting in the paper.
-
forward
(input: torch.Tensor) → torch.Tensor[source]¶ Perform source separation. Generate audio source waveforms.
- Parameters
input (torch.Tensor) – 3D Tensor with shape [batch, channel==1, frames]
- Returns
3D Tensor with shape [batch, channel==num_sources, frames]
- Return type
DeepSpeech¶
-
class
torchaudio.models.
DeepSpeech
(n_feature: int, n_hidden: int = 2048, n_class: int = 40, dropout: float = 0.0)[source]¶ DeepSpeech model architecture from 2.
- Parameters
n_feature – Number of input features
n_hidden – Internal hidden unit size.
n_class – Number of output classes
-
forward
(x: torch.Tensor) → torch.Tensor[source]¶ - Parameters
x (torch.Tensor) – Tensor of dimension (batch, channel, time, feature).
- Returns
Predictor tensor of dimension (batch, time, class).
- Return type
Tensor
Wav2Letter¶
-
class
torchaudio.models.
Wav2Letter
(num_classes: int = 40, input_type: str = 'waveform', num_features: int = 1)[source]¶ Wav2Letter model architecture from 3.
\(\text{padding} = \frac{\text{ceil}(\text{kernel} - \text{stride})}{2}\)
- Parameters
-
forward
(x: torch.Tensor) → torch.Tensor[source]¶ - Parameters
x (torch.Tensor) – Tensor of dimension (batch_size, num_features, input_length).
- Returns
Predictor tensor of dimension (batch_size, number_of_classes, input_length).
- Return type
Tensor
Wav2Vec2.0¶
Wav2Vec2Model¶
-
class
torchaudio.models.
Wav2Vec2Model
(feature_extractor: torch.nn.modules.module.Module, encoder: torch.nn.modules.module.Module)[source]¶ Encoder model used in [4].
Note
To build the model, please use one of the factory functions.
- Parameters
feature_extractor (torch.nn.Module) – Feature extractor that extracts feature vectors from raw audio Tensor.
encoder (torch.nn.Module) – Encoder that converts the audio features into the sequence of probability distribution (in negative log-likelihood) over labels.
-
extract_features
(waveforms: torch.Tensor, lengths: Optional[torch.Tensor] = None) → Tuple[torch.Tensor, Optional[torch.Tensor]][source]¶ Extract feature vectors from raw waveforms
- Parameters
waveforms (Tensor) – Audio tensor of shape
(batch, frames)
.lengths (Tensor, optional) – Indicates the valid length of each audio sample in the batch. Shape:
(batch, )
.
- Returns
- Feature vectors.
Shape:
(batch, frames, feature dimention)
- Tensor, optional:
Indicates the valid length of each feature in the batch, computed based on the given
lengths
argument. Shape:(batch, )
.
- Return type
Tensor
-
forward
(waveforms: torch.Tensor, lengths: Optional[torch.Tensor] = None) → Tuple[torch.Tensor, Optional[torch.Tensor]][source]¶ Compute the sequence of probability distribution over labels.
- Parameters
waveforms (Tensor) – Audio tensor of shape
(batch, frames)
.lengths (Tensor, optional) – Indicates the valid length of each audio sample in the batch. Shape:
(batch, )
.
- Returns
- The sequences of probability distribution (in logit) over labels.
Shape:
(batch, frames, num labels)
.- Tensor, optional:
Indicates the valid length of each feature in the batch, computed based on the given
lengths
argument. Shape:(batch, )
.
- Return type
Tensor
Factory Functions¶
wav2vec2_base¶
-
torchaudio.models.
wav2vec2_base
(num_out: int) → torchaudio.models.wav2vec2.model.Wav2Vec2Model[source]¶ Build wav2vec2.0 model with “Base” configuration from [4].
- Parameters
num_out – int The number of output labels.
- Returns
The resulting model.
- Return type
- Example - Reload fine-tuned model from Hugging Face:
>>> # Session 1 - Convert pretrained model from Hugging Face and save the parameters. >>> from torchaudio.models.wav2vec2.utils import import_huggingface_model >>> >>> original = Wav2Vec2ForCTC.from_pretrained("facebook/wav2vec2-base-960h") >>> model = import_huggingface_model(original) >>> torch.save(model.state_dict(), "wav2vec2-base-960h.pt") >>> >>> # Session 2 - Load model and the parameters >>> model = wav2vec2_base(num_out=32) >>> model.load_state_dict(torch.load("wav2vec2-base-960h.pt"))
wav2vec2_large¶
-
torchaudio.models.
wav2vec2_large
(num_out: int) → torchaudio.models.wav2vec2.model.Wav2Vec2Model[source]¶ Build wav2vec2.0 model with “Large” configuration from [4].
- Parameters
num_out – int The number of output labels.
- Returns
The resulting model.
- Return type
- Example - Reload fine-tuned model from Hugging Face:
>>> # Session 1 - Convert pretrained model from Hugging Face and save the parameters. >>> from torchaudio.models.wav2vec2.utils import import_huggingface_model >>> >>> original = Wav2Vec2ForCTC.from_pretrained("facebook/wav2vec2-large-960h") >>> model = import_huggingface_model(original) >>> torch.save(model.state_dict(), "wav2vec2-base-960h.pt") >>> >>> # Session 2 - Load model and the parameters >>> model = wav2vec2_large(num_out=32) >>> model.load_state_dict(torch.load("wav2vec2-base-960h.pt"))
wav2vec2_large_lv60k¶
-
torchaudio.models.
wav2vec2_large_lv60k
(num_out: int) → torchaudio.models.wav2vec2.model.Wav2Vec2Model[source]¶ Build wav2vec2.0 model with “Large LV-60k” configuration from [4].
- Parameters
num_out – int The number of output labels.
- Returns
The resulting model.
- Return type
- Example - Reload fine-tuned model from Hugging Face:
>>> # Session 1 - Convert pretrained model from Hugging Face and save the parameters. >>> from torchaudio.models.wav2vec2.utils import import_huggingface_model >>> >>> original = Wav2Vec2ForCTC.from_pretrained("facebook/wav2vec2-large-960h-lv60-self") >>> model = import_huggingface_model(original) >>> torch.save(model.state_dict(), "wav2vec2-base-960h.pt") >>> >>> # Session 2 - Load model and the parameters >>> model = wav2vec2_large_lv60k(num_out=32) >>> model.load_state_dict(torch.load("wav2vec2-base-960h.pt"))
Utility Functions¶
import_huggingface_model¶
-
torchaudio.models.wav2vec2.utils.
import_huggingface_model
(original: torch.nn.modules.module.Module) → torchaudio.models.wav2vec2.model.Wav2Vec2Model[source]¶ Import wav2vec2 model from Hugging Face’s Transformers.
- Parameters
original (torch.nn.Module) – An instance of
Wav2Vec2ForCTC
fromtransformers
.- Returns
Imported model.
- Return type
- Example
>>> from torchaudio.models.wav2vec2.utils import import_huggingface_model >>> >>> original = Wav2Vec2ForCTC.from_pretrained("facebook/wav2vec2-base-960h") >>> model = import_huggingface_model(original) >>> >>> waveforms, _ = torchaudio.load("audio.wav") >>> logits, _ = model(waveforms)
import_fairseq_model¶
-
torchaudio.models.wav2vec2.utils.
import_fairseq_model
(original: torch.nn.modules.module.Module, num_out: Optional[int] = None) → torchaudio.models.wav2vec2.model.Wav2Vec2Model[source]¶ Build Wav2Vec2Model from pretrained parameters published by fairseq.
- Parameters
original (torch.nn.Module) – An instance of fairseq’s Wav2Vec2.0 model class. Either
fairseq.models.wav2vec.wav2vec2_asr.Wav2VecEncoder
orfairseq.models.wav2vec.wav2vec2.Wav2Vec2Model
.num_out (int, optional) – The number of output labels. Required only when the original model is an instance of
fairseq.models.wav2vec.wav2vec2.Wav2Vec2Model
.
- Returns
Imported model.
- Return type
- Example - Loading pretrain-only model
>>> from torchaudio.models.wav2vec2.utils import import_fairseq_model >>> >>> # Load model using fairseq >>> model_file = 'wav2vec_small.pt' >>> model, _, _ = fairseq.checkpoint_utils.load_model_ensemble_and_task([model_file]) >>> original = model[0] >>> imported = import_fairseq_model(original, num_out=28) >>> >>> # Perform feature extraction >>> waveform, _ = torchaudio.load('audio.wav') >>> features, _ = imported.extract_features(waveform) >>> >>> # Compare result with the original model from fairseq >>> reference = original.feature_extractor(waveform).transpose(1, 2) >>> torch.testing.assert_allclose(features, reference)
- Example - Fine-tuned model
>>> from torchaudio.models.wav2vec2.utils import import_fairseq_model >>> >>> # Load model using fairseq >>> model_file = 'wav2vec_small_960h.pt' >>> model, _, _ = fairseq.checkpoint_utils.load_model_ensemble_and_task([model_file]) >>> original = model[0] >>> imported = import_fairseq_model(original.w2v_encoder) >>> >>> # Perform encoding >>> waveform, _ = torchaudio.load('audio.wav') >>> emission, _ = imported(waveform) >>> >>> # Compare result with the original model from fairseq >>> mask = torch.zeros_like(waveform) >>> reference = original(waveform, mask)['encoder_out'].transpose(0, 1) >>> torch.testing.assert_allclose(emission, reference)
WaveRNN¶
-
class
torchaudio.models.
WaveRNN
(upsample_scales: List[int], n_classes: int, hop_length: int, n_res_block: int = 10, n_rnn: int = 512, n_fc: int = 512, kernel_size: int = 5, n_freq: int = 128, n_hidden: int = 128, n_output: int = 128)[source]¶ WaveRNN model based on the implementation from fatchord.
The original implementation was introduced in 5. The input channels of waveform and spectrogram have to be 1. The product of upsample_scales must equal hop_length.
- Parameters
upsample_scales – the list of upsample scales.
n_classes – the number of output classes.
hop_length – the number of samples between the starts of consecutive frames.
n_res_block – the number of ResBlock in stack. (Default:
10
)n_rnn – the dimension of RNN layer. (Default:
512
)n_fc – the dimension of fully connected layer. (Default:
512
)kernel_size – the number of kernel size in the first Conv1d layer. (Default:
5
)n_freq – the number of bins in a spectrogram. (Default:
128
)n_hidden – the number of hidden dimensions of resblock. (Default:
128
)n_output – the number of output dimensions of melresnet. (Default:
128
)
- Example
>>> wavernn = WaveRNN(upsample_scales=[5,5,8], n_classes=512, hop_length=200) >>> waveform, sample_rate = torchaudio.load(file) >>> # waveform shape: (n_batch, n_channel, (n_time - kernel_size + 1) * hop_length) >>> specgram = MelSpectrogram(sample_rate)(waveform) # shape: (n_batch, n_channel, n_freq, n_time) >>> output = wavernn(waveform, specgram) >>> # output shape: (n_batch, n_channel, (n_time - kernel_size + 1) * hop_length, n_classes)
-
forward
(waveform: torch.Tensor, specgram: torch.Tensor) → torch.Tensor[source]¶ Pass the input through the WaveRNN model.
- Parameters
waveform – the input waveform to the WaveRNN layer (n_batch, 1, (n_time - kernel_size + 1) * hop_length)
specgram – the input spectrogram to the WaveRNN layer (n_batch, 1, n_freq, n_time)
- Returns
(n_batch, 1, (n_time - kernel_size + 1) * hop_length, n_classes)
- Return type
Tensor shape
References¶
- 1
Yi Luo and Nima Mesgarani. Conv-tasnet: surpassing ideal time–frequency magnitude masking for speech separation. IEEE/ACM Transactions on Audio, Speech, and Language Processing, 27(8):1256–1266, Aug 2019. URL: http://dx.doi.org/10.1109/TASLP.2019.2915167, doi:10.1109/taslp.2019.2915167.
- 2
Awni Hannun, Carl Case, Jared Casper, Bryan Catanzaro, Greg Diamos, Erich Elsen, Ryan Prenger, Sanjeev Satheesh, Shubho Sengupta, Adam Coates, and Andrew Y. Ng. Deep speech: scaling up end-to-end speech recognition. 2014. arXiv:1412.5567.
- 3
Ronan Collobert, Christian Puhrsch, and Gabriel Synnaeve. Wav2letter: an end-to-end convnet-based speech recognition system. 2016. arXiv:1609.03193.
- 4(1,2,3,4)
Alexei Baevski, Henry Zhou, Abdelrahman Mohamed, and Michael Auli. Wav2vec 2.0: a framework for self-supervised learning of speech representations. 2020. arXiv:2006.11477.
- 5
Nal Kalchbrenner, Erich Elsen, Karen Simonyan, Seb Noury, Norman Casagrande, Edward Lockhart, Florian Stimberg, Aaron van den Oord, Sander Dieleman, and Koray Kavukcuoglu. Efficient neural audio synthesis. 2018. arXiv:1802.08435.