torchaudio.models¶
The models subpackage contains definitions of models for addressing common audio tasks.
ConvTasNet¶
-
class
torchaudio.models.
ConvTasNet
(num_sources: int = 2, enc_kernel_size: int = 16, enc_num_feats: int = 512, msk_kernel_size: int = 3, msk_num_feats: int = 128, msk_num_hidden_feats: int = 512, msk_num_layers: int = 8, msk_num_stacks: int = 3)[source]¶ Conv-TasNet: a fully-convolutional time-domain audio separation network
- Parameters
num_sources (int) – The number of sources to split.
enc_kernel_size (int) – The convolution kernel size of the encoder/decoder, <L>.
enc_num_feats (int) – The feature dimensions passed to mask generator, <N>.
msk_kernel_size (int) – The convolution kernel size of the mask generator, <P>.
msk_num_feats (int) – The input/output feature dimension of conv block in the mask generator, <B, Sc>.
msk_num_hidden_feats (int) – The internal feature dimension of conv block of the mask generator, <H>.
msk_num_layers (int) – The number of layers in one conv block of the mask generator, <X>.
msk_num_stacks (int) – The numbr of conv blocks of the mask generator, <R>.
Note
This implementation corresponds to the “non-causal” setting in the paper.
- Reference:
Conv-TasNet: Surpassing Ideal Time–Frequency Magnitude Masking for Speech Separation
Luo, Yi and Mesgarani, Nima
-
forward
(input: torch.Tensor) → torch.Tensor[source]¶ Perform source separation. Generate audio source waveforms.
- Parameters
input (torch.Tensor) – 3D Tensor with shape [batch, channel==1, frames]
- Returns
3D Tensor with shape [batch, channel==num_sources, frames]
- Return type
Wav2Letter¶
-
class
torchaudio.models.
Wav2Letter
(num_classes: int = 40, input_type: str = 'waveform', num_features: int = 1)[source]¶ Wav2Letter model architecture from the Wav2Letter an End-to-End ConvNet-based Speech Recognition System.
\(\text{padding} = \frac{\text{ceil}(\text{kernel} - \text{stride})}{2}\)
- Parameters
-
forward
(x: torch.Tensor) → torch.Tensor[source]¶ - Parameters
x (torch.Tensor) – Tensor of dimension (batch_size, num_features, input_length).
- Returns
Predictor tensor of dimension (batch_size, number_of_classes, input_length).
- Return type
Tensor
WaveRNN¶
-
class
torchaudio.models.
WaveRNN
(upsample_scales: List[int], n_classes: int, hop_length: int, n_res_block: int = 10, n_rnn: int = 512, n_fc: int = 512, kernel_size: int = 5, n_freq: int = 128, n_hidden: int = 128, n_output: int = 128)[source]¶ WaveRNN model based on the implementation from fatchord.
The original implementation was introduced in “Efficient Neural Audio Synthesis”. The input channels of waveform and spectrogram have to be 1. The product of upsample_scales must equal hop_length.
- Parameters
upsample_scales – the list of upsample scales.
n_classes – the number of output classes.
hop_length – the number of samples between the starts of consecutive frames.
n_res_block – the number of ResBlock in stack. (Default:
10
)n_rnn – the dimension of RNN layer. (Default:
512
)n_fc – the dimension of fully connected layer. (Default:
512
)kernel_size – the number of kernel size in the first Conv1d layer. (Default:
5
)n_freq – the number of bins in a spectrogram. (Default:
128
)n_hidden – the number of hidden dimensions of resblock. (Default:
128
)n_output – the number of output dimensions of melresnet. (Default:
128
)
- Example
>>> wavernn = WaveRNN(upsample_scales=[5,5,8], n_classes=512, hop_length=200) >>> waveform, sample_rate = torchaudio.load(file) >>> # waveform shape: (n_batch, n_channel, (n_time - kernel_size + 1) * hop_length) >>> specgram = MelSpectrogram(sample_rate)(waveform) # shape: (n_batch, n_channel, n_freq, n_time) >>> output = wavernn(waveform, specgram) >>> # output shape: (n_batch, n_channel, (n_time - kernel_size + 1) * hop_length, n_classes)
-
forward
(waveform: torch.Tensor, specgram: torch.Tensor) → torch.Tensor[source]¶ Pass the input through the WaveRNN model.
- Parameters
waveform – the input waveform to the WaveRNN layer (n_batch, 1, (n_time - kernel_size + 1) * hop_length)
specgram – the input spectrogram to the WaveRNN layer (n_batch, 1, n_freq, n_time)
- Returns
(n_batch, 1, (n_time - kernel_size + 1) * hop_length, n_classes)
- Return type
Tensor shape