torchaudio.models

The models subpackage contains definitions of models for addressing common audio tasks.

ConvTasNet

class torchaudio.models.ConvTasNet(num_sources: int = 2, enc_kernel_size: int = 16, enc_num_feats: int = 512, msk_kernel_size: int = 3, msk_num_feats: int = 128, msk_num_hidden_feats: int = 512, msk_num_layers: int = 8, msk_num_stacks: int = 3)[source]

Conv-TasNet: a fully-convolutional time-domain audio separation network

Parameters

num_sources (int) – The number of sources to split.
enc_kernel_size (int) – The convolution kernel size of the encoder/decoder, <L>.
enc_num_feats (int) – The feature dimensions passed to mask generator, <N>.
msk_kernel_size (int) – The convolution kernel size of the mask generator, <P>.
msk_num_feats (int) – The input/output feature dimension of conv block in the mask generator, <B, Sc>.
msk_num_hidden_feats (int) – The internal feature dimension of conv block of the mask generator, <H>.
msk_num_layers (int) – The number of layers in one conv block of the mask generator, <X>.
msk_num_stacks (int) – The numbr of conv blocks of the mask generator, <R>.

Note

This implementation corresponds to the “non-causal” setting in the paper.

Reference:

Conv-TasNet: Surpassing Ideal Time–Frequency Magnitude Masking for Speech Separation

Luo, Yi and Mesgarani, Nima

https://arxiv.org/abs/1809.07454

forward(input: torch.Tensor) → torch.Tensor[source]

Perform source separation. Generate audio source waveforms.

Parameters: input (torch.Tensor) – 3D Tensor with shape [batch, channel==1, frames]
Returns: 3D Tensor with shape [batch, channel==num_sources, frames]
Return type: torch.Tensor

Wav2Letter

class torchaudio.models.Wav2Letter(num_classes: int = 40, input_type: str = 'waveform', num_features: int = 1)[source]

Wav2Letter model architecture from the Wav2Letter an End-to-End ConvNet-based Speech Recognition System.

$\text{padding} = \frac{\text{ceil}(\text{kernel} - \text{stride})}{2}$

Parameters

num_classes (int, optional) – Number of classes to be classified. (Default: 40)
input_type (str, optional) – Wav2Letter can use as input: waveform, power_spectrum or mfcc (Default: waveform).
num_features (int, optional) – Number of input features that the network will receive (Default: 1).

forward(x: torch.Tensor) → torch.Tensor[source]

Parameters: x (torch.Tensor) – Tensor of dimension (batch_size, num_features, input_length).
Returns: Predictor tensor of dimension (batch_size, number_of_classes, input_length).
Return type: Tensor

WaveRNN

class torchaudio.models.WaveRNN(upsample_scales: List[int], n_classes: int, hop_length: int, n_res_block: int = 10, n_rnn: int = 512, n_fc: int = 512, kernel_size: int = 5, n_freq: int = 128, n_hidden: int = 128, n_output: int = 128)[source]

WaveRNN model based on the implementation from fatchord.

The original implementation was introduced in “Efficient Neural Audio Synthesis”. The input channels of waveform and spectrogram have to be 1. The product of upsample_scales must equal hop_length.

Parameters

upsample_scales – the list of upsample scales.
n_classes – the number of output classes.
hop_length – the number of samples between the starts of consecutive frames.
n_res_block – the number of ResBlock in stack. (Default: 10)
n_rnn – the dimension of RNN layer. (Default: 512)
n_fc – the dimension of fully connected layer. (Default: 512)
kernel_size – the number of kernel size in the first Conv1d layer. (Default: 5)
n_freq – the number of bins in a spectrogram. (Default: 128)
n_hidden – the number of hidden dimensions of resblock. (Default: 128)
n_output – the number of output dimensions of melresnet. (Default: 128)

Example

>>> wavernn = WaveRNN(upsample_scales=[5,5,8], n_classes=512, hop_length=200)
>>> waveform, sample_rate = torchaudio.load(file)
>>> # waveform shape: (n_batch, n_channel, (n_time - kernel_size + 1) * hop_length)
>>> specgram = MelSpectrogram(sample_rate)(waveform)  # shape: (n_batch, n_channel, n_freq, n_time)
>>> output = wavernn(waveform, specgram)
>>> # output shape: (n_batch, n_channel, (n_time - kernel_size + 1) * hop_length, n_classes)

forward(waveform: torch.Tensor, specgram: torch.Tensor) → torch.Tensor[source]

Pass the input through the WaveRNN model.

Parameters

waveform – the input waveform to the WaveRNN layer (n_batch, 1, (n_time - kernel_size + 1) * hop_length)
specgram – the input spectrogram to the WaveRNN layer (n_batch, 1, n_freq, n_time)

Returns

(n_batch, 1, (n_time - kernel_size + 1) * hop_length, n_classes)

Return type

Tensor shape

torchaudio.models

ConvTasNet

Wav2Letter

WaveRNN

Docs

Tutorials

Resources