torchaudio.prototype.models¶
conformer_rnnt_model¶
- torchaudio.prototype.models.conformer_rnnt_model(*, input_dim: int, encoding_dim: int, time_reduction_stride: int, conformer_input_dim: int, conformer_ffn_dim: int, conformer_num_layers: int, conformer_num_heads: int, conformer_depthwise_conv_kernel_size: int, conformer_dropout: float, num_symbols: int, symbol_embedding_dim: int, num_lstm_layers: int, lstm_hidden_dim: int, lstm_layer_norm: int, lstm_layer_norm_epsilon: int, lstm_dropout: int, joiner_activation: str) RNNT [source]¶
Builds Conformer-based recurrent neural network transducer (RNN-T) model.
- Parameters:
input_dim (int) – dimension of input sequence frames passed to transcription network.
encoding_dim (int) – dimension of transcription- and prediction-network-generated encodings passed to joint network.
time_reduction_stride (int) – factor by which to reduce length of input sequence.
conformer_input_dim (int) – dimension of Conformer input.
conformer_ffn_dim (int) – hidden layer dimension of each Conformer layer’s feedforward network.
conformer_num_layers (int) – number of Conformer layers to instantiate.
conformer_num_heads (int) – number of attention heads in each Conformer layer.
conformer_depthwise_conv_kernel_size (int) – kernel size of each Conformer layer’s depthwise convolution layer.
conformer_dropout (float) – Conformer dropout probability.
num_symbols (int) – cardinality of set of target tokens.
symbol_embedding_dim (int) – dimension of each target token embedding.
num_lstm_layers (int) – number of LSTM layers to instantiate.
lstm_hidden_dim (int) – output dimension of each LSTM layer.
lstm_layer_norm (bool) – if
True
, enables layer normalization for LSTM layers.lstm_layer_norm_epsilon (float) – value of epsilon to use in LSTM layer normalization layers.
lstm_dropout (float) – LSTM dropout probability.
joiner_activation (str) – activation function to use in the joiner. Must be one of (“relu”, “tanh”). (Default: “relu”)
Returns –
- RNNT:
Conformer RNN-T model.
conformer_rnnt_base¶
emformer_hubert_model¶
- torchaudio.prototype.models.emformer_hubert_model(extractor_input_dim: int, extractor_output_dim: int, extractor_use_bias: bool, extractor_stride: int, encoder_input_dim: int, encoder_output_dim: int, encoder_num_heads: int, encoder_ffn_dim: int, encoder_num_layers: int, encoder_segment_length: int, encoder_left_context_length: int, encoder_right_context_length: int, encoder_dropout: float, encoder_activation: str, encoder_max_memory_size: int, encoder_weight_init_scale_strategy: Optional[str], encoder_tanh_on_mem: bool, aux_num_out: Optional[int]) Wav2Vec2Model [source]¶
Build a custom Emformer HuBERT model.
- Parameters:
extractor_input_dim (int) – The input dimension for feature extractor.
extractor_output_dim (int) – The output dimension after feature extractor.
extractor_use_bias (bool) – If
True
, enable bias parameter in the linear layer of feature extractor.extractor_stride (int) – Number of frames to merge for the output frame in feature extractor.
encoder_input_dim (int) – The input dimension for Emformer layer.
encoder_output_dim (int) – The output dimension after EmformerEncoder.
encoder_num_heads (int) – Number of attention heads in each Emformer layer.
encoder_ffn_dim (int) – Hidden layer dimension of feedforward network in Emformer.
encoder_num_layers (int) – Number of Emformer layers to instantiate.
encoder_segment_length (int) – Length of each input segment.
encoder_left_context_length (int) – Length of left context.
encoder_right_context_length (int) – Length of right context.
encoder_dropout (float) – Dropout probability.
encoder_activation (str) – Activation function to use in each Emformer layer’s feedforward network. Must be one of (“relu”, “gelu”, “silu”).
encoder_max_memory_size (int) – Maximum number of memory elements to use.
encoder_weight_init_scale_strategy (str or None) – Per-layer weight initialization scaling strategy. Must be one of (“depthwise”, “constant”,
None
).encoder_tanh_on_mem (bool) – If
True
, applies tanh to memory elements.aux_num_out (int or None) – When provided, attach an extra linear layer on top of encoder, which can be used for fine-tuning.
- Returns:
The resulting
torchaudio.models.Wav2Vec2Model
model with atorchaudio.models.Emformer
encoder.- Return type:
emformer_hubert_base¶
- torchaudio.prototype.models.emformer_hubert_base(extractor_input_dim: int = 80, extractor_output_dim: int = 128, encoder_dropout: float = 0.1, aux_num_out: Optional[int] = None) Wav2Vec2Model [source]¶
Build Emformer HuBERT Model with 20 Emformer layers.
- Parameters:
extractor_input_dim (int, optional) – The input dimension for feature extractor. (Default: 80)
extractor_output_dim (int, optional) – The output dimension after feature extractor. (Default: 128)
encoder_dropout (float, optional) – Dropout probability in Emformer. (Default: 0.1)
aux_num_out (int or None, optional) – Output dimension of aux layer for fine-tuning. (Default:
None
)
- Returns:
The resulting
torchaudio.models.Wav2Vec2Model
model with atorchaudio.models.Emformer
encoder.- Return type:
ConvEmformer¶
- class torchaudio.prototype.models.ConvEmformer(input_dim: int, num_heads: int, ffn_dim: int, num_layers: int, segment_length: int, kernel_size: int, dropout: float = 0.0, ffn_activation: str = 'relu', left_context_length: int = 0, right_context_length: int = 0, max_memory_size: int = 0, weight_init_scale_strategy: Optional[str] = 'depthwise', tanh_on_mem: bool = False, negative_inf: float = -100000000.0, conv_activation: str = 'silu')[source]¶
Implements the convolution-augmented streaming transformer architecture introduced in Streaming Transformer Transducer based Speech Recognition Using Non-Causal Convolution [Shi et al., 2022].
- Parameters:
input_dim (int) – input dimension.
num_heads (int) – number of attention heads in each ConvEmformer layer.
ffn_dim (int) – hidden layer dimension of each ConvEmformer layer’s feedforward network.
num_layers (int) – number of ConvEmformer layers to instantiate.
segment_length (int) – length of each input segment.
kernel_size (int) – size of kernel to use in convolution modules.
dropout (float, optional) – dropout probability. (Default: 0.0)
ffn_activation (str, optional) – activation function to use in feedforward networks. Must be one of (“relu”, “gelu”, “silu”). (Default: “relu”)
left_context_length (int, optional) – length of left context. (Default: 0)
right_context_length (int, optional) – length of right context. (Default: 0)
max_memory_size (int, optional) – maximum number of memory elements to use. (Default: 0)
weight_init_scale_strategy (str or None, optional) – per-layer weight initialization scaling strategy. Must be one of (“depthwise”, “constant”,
None
). (Default: “depthwise”)tanh_on_mem (bool, optional) – if
True
, applies tanh to memory elements. (Default:False
)negative_inf (float, optional) – value to use for negative infinity in attention weights. (Default: -1e8)
conv_activation (str, optional) – activation function to use in convolution modules. Must be one of (“relu”, “gelu”, “silu”). (Default: “silu”)
Examples
>>> conv_emformer = ConvEmformer(80, 4, 1024, 12, 16, 8, right_context_length=4) >>> input = torch.rand(10, 200, 80) >>> lengths = torch.randint(1, 200, (10,)) >>> output, lengths = conv_emformer(input, lengths) >>> input = torch.rand(4, 20, 80) >>> lengths = torch.ones(4) * 20 >>> output, lengths, states = conv_emformer.infer(input, lengths, None)
- forward(input: Tensor, lengths: Tensor) Tuple[Tensor, Tensor] ¶
Forward pass for training and non-streaming inference.
B: batch size; T: max number of input frames in batch; D: feature dimension of each frame.
- Parameters:
input (torch.Tensor) – utterance frames right-padded with right context frames, with shape (B, T + right_context_length, D).
lengths (torch.Tensor) – with shape (B,) and i-th element representing number of valid utterance frames for i-th batch element in
input
.
- Returns:
- Tensor
output frames, with shape (B, T, D).
- Tensor
output lengths, with shape (B,) and i-th element representing number of valid frames for i-th batch element in output frames.
- Return type:
(Tensor, Tensor)
- infer(input: Tensor, lengths: Tensor, states: Optional[List[List[Tensor]]] = None) Tuple[Tensor, Tensor, List[List[Tensor]]] ¶
Forward pass for streaming inference.
B: batch size; D: feature dimension of each frame.
- Parameters:
input (torch.Tensor) – utterance frames right-padded with right context frames, with shape (B, segment_length + right_context_length, D).
lengths (torch.Tensor) – with shape (B,) and i-th element representing number of valid frames for i-th batch element in
input
.states (List[List[torch.Tensor]] or None, optional) – list of lists of tensors representing internal state generated in preceding invocation of
infer
. (Default:None
)
- Returns:
- Tensor
output frames, with shape (B, segment_length, D).
- Tensor
output lengths, with shape (B,) and i-th element representing number of valid frames for i-th batch element in output frames.
- List[List[Tensor]]
output states; list of lists of tensors representing internal state generated in current invocation of
infer
.
- Return type:
(Tensor, Tensor, List[List[Tensor]])
ConformerWav2Vec2PretrainModel¶
- class torchaudio.prototype.models.ConformerWav2Vec2PretrainModel(wav2vec2: Wav2Vec2Model, mask_generator: Module, negative_sampler: Module)[source]¶
Conformer Wav2Vec2 pre-train model for training from scratch.
Note
To build the model, please use one of the factory functions,
conformer_wav2vec2_base()
orconformer_wav2vec2_large()
- Parameters:
wav2vec2 (nn.Module) – Conformer based Wav2Vec2 model, including feature extractor and conformer encoder components.
mask_generator (nn.Module) – Mask generator that generates the mask for masked prediction during training.
negative_sampler (nn.Module) – Negative sampler to apply after masking.
- forward(features: Tensor, audio_lengths: Optional[Tensor] = None) Tuple[Tensor, Optional[Tensor], Tensor, Tensor] [source]¶
- Parameters:
features (Tensor) – Tensor of audio features of shape (batch, frame, dim).
audio_lengths (Tensor or None, optional) – Tensor of valid length of each valid auidio in the batch. shape: (batch, ) (Default:
None
)
- Returns:
- Tensor
The masked sequences of probability distribution of shape (batch, frame dim).
- Tensor or None
If
lengths
argument was provided, a Tensor of shape (batch, ) representing valid length in time axis is returns.- Tensor
The mask indices.
- Tensor
The targets, prior to negative sampling.
- Tensor
The negative samples.
- Tensor
The indices of the negative samples.
- Return type:
(Tensor, Optional[Tensor], Tensor, Tensor, Tensor, Tensor)
conformer_wav2vec2_model¶
- torchaudio.prototype.models.conformer_wav2vec2_model(extractor_input_dim: int, extractor_output_dim: int, extractor_stride: int, encoder_embed_dim: int, encoder_projection_dropout: float, encoder_num_layers: int, encoder_num_heads: int, encoder_ff_interm_features: int, encoder_depthwise_conv_kernel_size: Union[int, List[int]], encoder_dropout: float, encoder_convolution_first: bool, encoder_use_group_norm: bool) Wav2Vec2Model [source]¶
Build a custom Conformer Wav2Vec2Model
- Parameters:
extractor_input_dim (int) – Input dimension of the features.
extractor_output_dim (int) – Output dimension after feature extraction.
extractor_stride (int) – Stride used in time reduction layer of feature extraction.
encoder_embed_dim (int) – The dimension of the embedding in the feature projection.
encoder_projection_dropout (float) – The dropout probability applied after the input feature is projected to
embed_dim
encoder_num_layers (int) – Number of Conformer layers in the encoder.
encoder_num_heads (int) – Number of heads in each Conformer layer.
encoder_ff_interm_features (int) – Hidden layer dimension of the feedforward network in each Conformer layer.
encoder_depthwise_conv_kernel_size (int or List[int]) – List of kernel sizes corresponding to each of the Conformer layers. If int is provided, all layers will have the same kernel size.
encoder_dropout (float) – Dropout probability in each Conformer layer.
encoder_convolution_first (bool) – Whether to apply the convolution module ahead of the attention module in each Conformer layer.
encoder_use_group_norm (bool) – Whether to use
GroupNorm
rather thanBatchNorm1d
in the convolution module in each Conformer layer.
- Returns:
The resulting wav2vec2 model with a conformer encoder.
- Return type:
conformer_wav2vec2_base¶
- torchaudio.prototype.models.conformer_wav2vec2_base(extractor_input_dim: int = 64, extractor_output_dim: int = 256, encoder_projection_dropout: float = 0.0) Wav2Vec2Model [source]¶
Build Conformer Wav2Vec2 Model with “small” architecture from Conformer-Based Slef-Supervised Learning for Non-Speech Audio Tasks [Srivastava et al., 2022]
- Parameters:
- Returns:
The resulting wav2vec2 model with a conformer encoder and
base
configuration.- Return type:
conformer_wav2vec2_pretrain_model¶
- torchaudio.prototype.models.conformer_wav2vec2_pretrain_model(extractor_input_dim: int, extractor_output_dim: int, extractor_stride: int, encoder_embed_dim: int, encoder_projection_dropout: float, encoder_num_layers: int, encoder_num_heads: int, encoder_ff_interm_features: int, encoder_depthwise_conv_kernel_size: int, encoder_dropout: float, encoder_convolution_first: bool, encoder_use_group_norm: bool, mask_prob: float, mask_selection: str, mask_other: float, mask_length: int, no_mask_overlap: bool, mask_min_space: int, mask_channel_prob: float, mask_channel_selection: str, mask_channel_other: float, mask_channel_length: int, no_mask_channel_overlap: bool, mask_channel_min_space: int, num_negatives: int, cross_sample_negatives: int) ConformerWav2Vec2PretrainModel [source]¶
Build a custom Conformer Wav2Vec2 Model for pre-training
- Parameters:
extractor_input_dim (int) – Input dimension of the features.
extractor_output_dim (int) – Output dimension after feature extraction.
extractor_stride (int) – Stride used in time reduction layer of feature extraction.
encoder_embed_dim (int) – The dimension of the embedding in the feature projection.
encoder_projection_dropout (float) – The dropout probability applied after the input feature is projected to
embed_dim
encoder_num_layers (int) – Number of Conformer layers in the encoder.
encoder_num_heads (int) – Number of heads in each Conformer layer.
encoder_ff_interm_features (int) – Hidden layer dimension of the feedforward network in each Conformer layer.
encoder_depthwise_conv_kernel_size (int or List[int]) – List of kernel sizes corresponding to each of the Conformer layers. If int is provided, all layers will have the same kernel size.
encoder_dropout (float) – Dropout probability in each Conformer layer.
encoder_convolution_first (bool) – Whether to apply the convolution module ahead of the attention module in each Conformer layer.
encoder_use_group_norm (bool) – Whether to use
GroupNorm
rather thanBatchNorm1d
in the convolution module in each Conformer layer.mask_prob (float) – Probability for each token to be chosen as start of the span to be masked.
mask_selection (str) – How to choose the mask length. Options: [
static
,uniform
,normal
,poisson
].mask_other (float) – Secondary mask argument (used for more complex distributions).
mask_length (int) – The lengths of the mask.
no_mask_overlap (bool) – Whether to allow masks to overlap.
mask_min_space (int) – Minimum space between spans (if no overlap is enabled).
mask_channel_prob – (float): The probability of replacing a feature with 0.
mask_channel_selection (str) – How to choose the mask length for channel masking. Options: [
static
,uniform
,normal
,poisson
].mask_channel_other (float) – Secondary mask argument for channel masking (used for more complex distributions).
mask_channel_length (int) – Minimum space between spans (if no overlap is enabled) for channel masking.
no_mask_channel_overlap (bool) – Whether to allow channel masks to overlap.
mask_channel_min_space (int) – Minimum space between spans for channel masking (if no overlap is enabled).
num_negatives (int) – Number of negatives to sample.
cross_sample_negatives (int) – Number of cross sampled negatives.
- Returns:
The resulting model.
- Return type:
conformer_wav2vec2_pretrain_base¶
- torchaudio.prototype.models.conformer_wav2vec2_pretrain_base(extractor_input_dim: int = 64, extractor_output_dim: int = 256, encoder_projection_dropout: float = 0.0, mask_prob: float = 0.3, mask_length: int = 3, num_negatives: int = 100, cross_sample_negatives: int = 0) ConformerWav2Vec2PretrainModel [source]¶
Build Conformer Wav2Vec2 Model for pre-training with “small” architecture from Conformer-Based Self-Supervised Learning for Non-Speech Audio Tasks [Srivastava et al., 2022]
- Parameters:
extractor_input_dim (int, optional) – Input dimension of the features. (Default: 64)
extractor_output_dim (int, optional) – Output dimension after feature extraction. (Default: 256)
encoder_projection_dropout (float, optional) – The dropout probability applied after the input feature is projected to
embed_dim
. (Default: 0.0)mask_prob (float, optional) – Probability for each token to be chosen as start of the span to be masked. (Default: 0.3)
mask_length (int, optional) – The lengths of the mask. (Default: 3)
num_negatives (int, optional) – Number of sampled negatives. (Default: 0)
cross_sample_negatives (int, optional) – Number of cross sampled negatives. (Default: 0)
- Returns:
The resulting model.
- Return type:
conformer_wav2vec2_pretrain_large¶
- torchaudio.prototype.models.conformer_wav2vec2_pretrain_large(extractor_input_dim: int = 64, extractor_output_dim: int = 256, encoder_projection_dropout: float = 0.0, mask_prob: float = 0.3, mask_length: int = 3, num_negatives: int = 100, cross_sample_negatives: int = 0) ConformerWav2Vec2PretrainModel [source]¶
Build Conformer Wav2Vec2 Model for pre-training with “large” architecture from Conformer-Based Slef-Supervised Learning for Non-Speech Audio Tasks [Srivastava et al., 2022]
- Parameters:
extractor_input_dim (int, optional) – Input dimension of the features. (Default: 64)
extractor_output_dim (int, optional) – Output dimension after feature extraction. (Default: 256)
encoder_projection_dropout (float, optional) – The dropout probability applied after the input feature is projected to
embed_dim
. (Default: 0.0)mask_prob (float, optional) – Probability for each token to be chosen as start of the span to be masked. (Default: 0.3)
mask_length (int, optional) – The lengths of the mask. (Default: 3)
num_negatives (int, optional) – Number of sampled negatives. (Default: 0)
cross_sample_negatives (int, optional) – Number of cross sampled negatives. (Default: 0)
- Returns:
The resulting model.
- Return type:
HiFiGANVocoder¶
- class torchaudio.prototype.models.HiFiGANVocoder(in_channels: int, upsample_rates: Tuple[int, ...], upsample_initial_channel: int, upsample_kernel_sizes: Tuple[int, ...], resblock_kernel_sizes: Tuple[int, ...], resblock_dilation_sizes: Tuple[Tuple[int, ...], ...], resblock_type: int, lrelu_slope: float)[source]¶
Generator part of HiFi GAN [Kong et al., 2020]. Source: https://github.com/jik876/hifi-gan/blob/4769534d45265d52a904b850da5a622601885777/models.py#L75
Note
To build the model, please use one of the factory functions:
hifigan_vocoder()
,hifigan_vocoder_v1()
,hifigan_vocoder_v2()
,hifigan_vocoder_v3()
.- Parameters:
in_channels (int) – Number of channels in the input features.
upsample_rates (tuple of
int
) – Factors by which each upsampling layer increases the time dimension.upsample_initial_channel (int) – Number of channels in the input feature tensor.
upsample_kernel_sizes (tuple of
int
) – Kernel size for each upsampling layer.resblock_kernel_sizes (tuple of
int
) – Kernel size for each residual block.resblock_dilation_sizes (tuple of tuples of
int
) – Dilation sizes for each 1D convolutional layer in each residual block. For resblock type 1 inner tuples should have length 3, because there are 3 convolutions in each layer. For resblock type 2 they should have length 2.resblock_type (int, 1 or 2) – Determines whether
ResBlock1
orResBlock2
will be used.lrelu_slope (float) – Slope of leaky ReLUs in activations.
hifigan_vocoder¶
- torchaudio.prototype.models.hifigan_vocoder(in_channels: int, upsample_rates: Tuple[int, ...], upsample_initial_channel: int, upsample_kernel_sizes: Tuple[int, ...], resblock_kernel_sizes: Tuple[int, ...], resblock_dilation_sizes: Tuple[Tuple[int, ...], ...], resblock_type: int, lrelu_slope: float) HiFiGANVocoder [source]¶
Builds HiFi GAN Vocoder [Kong et al., 2020].
- Parameters:
in_channels (int) – See
HiFiGANVocoder
.upsample_rates (tuple of
int
) – SeeHiFiGANVocoder
.upsample_initial_channel (int) – See
HiFiGANVocoder
.upsample_kernel_sizes (tuple of
int
) – SeeHiFiGANVocoder
.resblock_kernel_sizes (tuple of
int
) – SeeHiFiGANVocoder
.resblock_dilation_sizes (tuple of tuples of
int
) – SeeHiFiGANVocoder
.resblock_type (int, 1 or 2) – See
HiFiGANVocoder
.
- Returns:
generated model.
- Return type:
hifigan_vocoder_v1¶
- torchaudio.prototype.models.hifigan_vocoder_v1() HiFiGANVocoder [source]¶
Builds HiFiGAN Vocoder with V1 architecture [Kong et al., 2020].
- Returns:
generated model.
- Return type:
hifigan_vocoder_v2¶
- torchaudio.prototype.models.hifigan_vocoder_v2() HiFiGANVocoder [source]¶
Builds HiFiGAN Vocoder with V2 architecture [Kong et al., 2020].
- Returns:
generated model.
- Return type:
hifigan_vocoder_v3¶
- torchaudio.prototype.models.hifigan_vocoder_v3() HiFiGANVocoder [source]¶
Builds HiFiGAN Vocoder with V3 architecture [Kong et al., 2020].
- Returns:
generated model.
- Return type: