Shortcuts

torchaudio.models

The models subpackage contains definitions of models for addressing common audio tasks.

ConvTasNet

class torchaudio.models.ConvTasNet(num_sources: int = 2, enc_kernel_size: int = 16, enc_num_feats: int = 512, msk_kernel_size: int = 3, msk_num_feats: int = 128, msk_num_hidden_feats: int = 512, msk_num_layers: int = 8, msk_num_stacks: int = 3)[source]

Conv-TasNet: a fully-convolutional time-domain audio separation network

Parameters
  • num_sources (int) – The number of sources to split.

  • enc_kernel_size (int) – The convolution kernel size of the encoder/decoder, <L>.

  • enc_num_feats (int) – The feature dimensions passed to mask generator, <N>.

  • msk_kernel_size (int) – The convolution kernel size of the mask generator, <P>.

  • msk_num_feats (int) – The input/output feature dimension of conv block in the mask generator, <B, Sc>.

  • msk_num_hidden_feats (int) – The internal feature dimension of conv block of the mask generator, <H>.

  • msk_num_layers (int) – The number of layers in one conv block of the mask generator, <X>.

  • msk_num_stacks (int) – The numbr of conv blocks of the mask generator, <R>.

Note

This implementation corresponds to the “non-causal” setting in the paper.

Reference:
forward(input: torch.Tensor) → torch.Tensor[source]

Perform source separation. Generate audio source waveforms.

Parameters

input (torch.Tensor) – 3D Tensor with shape [batch, channel==1, frames]

Returns

3D Tensor with shape [batch, channel==num_sources, frames]

Return type

torch.Tensor

Wav2Letter

class torchaudio.models.Wav2Letter(num_classes: int = 40, input_type: str = 'waveform', num_features: int = 1)[source]

Wav2Letter model architecture from the Wav2Letter an End-to-End ConvNet-based Speech Recognition System.

\(\text{padding} = \frac{\text{ceil}(\text{kernel} - \text{stride})}{2}\)

Parameters
  • num_classes (int, optional) – Number of classes to be classified. (Default: 40)

  • input_type (str, optional) – Wav2Letter can use as input: waveform, power_spectrum or mfcc (Default: waveform).

  • num_features (int, optional) – Number of input features that the network will receive (Default: 1).

forward(x: torch.Tensor) → torch.Tensor[source]
Parameters

x (torch.Tensor) – Tensor of dimension (batch_size, num_features, input_length).

Returns

Predictor tensor of dimension (batch_size, number_of_classes, input_length).

Return type

Tensor

WaveRNN

class torchaudio.models.WaveRNN(upsample_scales: List[int], n_classes: int, hop_length: int, n_res_block: int = 10, n_rnn: int = 512, n_fc: int = 512, kernel_size: int = 5, n_freq: int = 128, n_hidden: int = 128, n_output: int = 128)[source]

WaveRNN model based on the implementation from fatchord.

The original implementation was introduced in “Efficient Neural Audio Synthesis”. The input channels of waveform and spectrogram have to be 1. The product of upsample_scales must equal hop_length.

Parameters
  • upsample_scales – the list of upsample scales.

  • n_classes – the number of output classes.

  • hop_length – the number of samples between the starts of consecutive frames.

  • n_res_block – the number of ResBlock in stack. (Default: 10)

  • n_rnn – the dimension of RNN layer. (Default: 512)

  • n_fc – the dimension of fully connected layer. (Default: 512)

  • kernel_size – the number of kernel size in the first Conv1d layer. (Default: 5)

  • n_freq – the number of bins in a spectrogram. (Default: 128)

  • n_hidden – the number of hidden dimensions of resblock. (Default: 128)

  • n_output – the number of output dimensions of melresnet. (Default: 128)

Example
>>> wavernn = WaveRNN(upsample_scales=[5,5,8], n_classes=512, hop_length=200)
>>> waveform, sample_rate = torchaudio.load(file)
>>> # waveform shape: (n_batch, n_channel, (n_time - kernel_size + 1) * hop_length)
>>> specgram = MelSpectrogram(sample_rate)(waveform)  # shape: (n_batch, n_channel, n_freq, n_time)
>>> output = wavernn(waveform, specgram)
>>> # output shape: (n_batch, n_channel, (n_time - kernel_size + 1) * hop_length, n_classes)
forward(waveform: torch.Tensor, specgram: torch.Tensor) → torch.Tensor[source]

Pass the input through the WaveRNN model.

Parameters
  • waveform – the input waveform to the WaveRNN layer (n_batch, 1, (n_time - kernel_size + 1) * hop_length)

  • specgram – the input spectrogram to the WaveRNN layer (n_batch, 1, n_freq, n_time)

Returns

(n_batch, 1, (n_time - kernel_size + 1) * hop_length, n_classes)

Return type

Tensor shape

Docs

Access comprehensive developer documentation for PyTorch

View Docs

Tutorials

Get in-depth tutorials for beginners and advanced developers

View Tutorials

Resources

Find development resources and get your questions answered

View Resources