The models subpackage contains definitions of models for addressing common audio tasks.


class torchaudio.models.ConvTasNet(num_sources: int = 2, enc_kernel_size: int = 16, enc_num_feats: int = 512, msk_kernel_size: int = 3, msk_num_feats: int = 128, msk_num_hidden_feats: int = 512, msk_num_layers: int = 8, msk_num_stacks: int = 3)[source]

Conv-TasNet: a fully-convolutional time-domain audio separation network

  • num_sources (int) – The number of sources to split.

  • enc_kernel_size (int) – The convolution kernel size of the encoder/decoder, <L>.

  • enc_num_feats (int) – The feature dimensions passed to mask generator, <N>.

  • msk_kernel_size (int) – The convolution kernel size of the mask generator, <P>.

  • msk_num_feats (int) – The input/output feature dimension of conv block in the mask generator, <B, Sc>.

  • msk_num_hidden_feats (int) – The internal feature dimension of conv block of the mask generator, <H>.

  • msk_num_layers (int) – The number of layers in one conv block of the mask generator, <X>.

  • msk_num_stacks (int) – The numbr of conv blocks of the mask generator, <R>.


This implementation corresponds to the “non-causal” setting in the paper.

forward(input: torch.Tensor) → torch.Tensor[source]

Perform source separation. Generate audio source waveforms.


input (torch.Tensor) – 3D Tensor with shape [batch, channel==1, frames]


3D Tensor with shape [batch, channel==num_sources, frames]

Return type



class torchaudio.models.Wav2Letter(num_classes: int = 40, input_type: str = 'waveform', num_features: int = 1)[source]

Wav2Letter model architecture from the Wav2Letter an End-to-End ConvNet-based Speech Recognition System.

\(\text{padding} = \frac{\text{ceil}(\text{kernel} - \text{stride})}{2}\)

  • num_classes (int, optional) – Number of classes to be classified. (Default: 40)

  • input_type (str, optional) – Wav2Letter can use as input: waveform, power_spectrum or mfcc (Default: waveform).

  • num_features (int, optional) – Number of input features that the network will receive (Default: 1).

forward(x: torch.Tensor) → torch.Tensor[source]

x (torch.Tensor) – Tensor of dimension (batch_size, num_features, input_length).


Predictor tensor of dimension (batch_size, number_of_classes, input_length).

Return type



class torchaudio.models.WaveRNN(upsample_scales: List[int], n_classes: int, hop_length: int, n_res_block: int = 10, n_rnn: int = 512, n_fc: int = 512, kernel_size: int = 5, n_freq: int = 128, n_hidden: int = 128, n_output: int = 128)[source]

WaveRNN model based on the implementation from fatchord.

The original implementation was introduced in “Efficient Neural Audio Synthesis”. The input channels of waveform and spectrogram have to be 1. The product of upsample_scales must equal hop_length.

  • upsample_scales – the list of upsample scales.

  • n_classes – the number of output classes.

  • hop_length – the number of samples between the starts of consecutive frames.

  • n_res_block – the number of ResBlock in stack. (Default: 10)

  • n_rnn – the dimension of RNN layer. (Default: 512)

  • n_fc – the dimension of fully connected layer. (Default: 512)

  • kernel_size – the number of kernel size in the first Conv1d layer. (Default: 5)

  • n_freq – the number of bins in a spectrogram. (Default: 128)

  • n_hidden – the number of hidden dimensions of resblock. (Default: 128)

  • n_output – the number of output dimensions of melresnet. (Default: 128)

>>> wavernn = WaveRNN(upsample_scales=[5,5,8], n_classes=512, hop_length=200)
>>> waveform, sample_rate = torchaudio.load(file)
>>> # waveform shape: (n_batch, n_channel, (n_time - kernel_size + 1) * hop_length)
>>> specgram = MelSpectrogram(sample_rate)(waveform)  # shape: (n_batch, n_channel, n_freq, n_time)
>>> output = wavernn(waveform, specgram)
>>> # output shape: (n_batch, n_channel, (n_time - kernel_size + 1) * hop_length, n_classes)
forward(waveform: torch.Tensor, specgram: torch.Tensor) → torch.Tensor[source]

Pass the input through the WaveRNN model.

  • waveform – the input waveform to the WaveRNN layer (n_batch, 1, (n_time - kernel_size + 1) * hop_length)

  • specgram – the input spectrogram to the WaveRNN layer (n_batch, 1, n_freq, n_time)


(n_batch, 1, (n_time - kernel_size + 1) * hop_length, n_classes)

Return type

Tensor shape


Access comprehensive developer documentation for PyTorch

View Docs


Get in-depth tutorials for beginners and advanced developers

View Tutorials


Find development resources and get your questions answered

View Resources