Shortcuts

torchaudio.prototype.models

conformer_rnnt_model

torchaudio.prototype.models.conformer_rnnt_model(*, input_dim: int, encoding_dim: int, time_reduction_stride: int, conformer_input_dim: int, conformer_ffn_dim: int, conformer_num_layers: int, conformer_num_heads: int, conformer_depthwise_conv_kernel_size: int, conformer_dropout: float, num_symbols: int, symbol_embedding_dim: int, num_lstm_layers: int, lstm_hidden_dim: int, lstm_layer_norm: int, lstm_layer_norm_epsilon: int, lstm_dropout: int, joiner_activation: str) RNNT[source]

Builds Conformer-based recurrent neural network transducer (RNN-T) model.

Parameters:
  • input_dim (int) – dimension of input sequence frames passed to transcription network.

  • encoding_dim (int) – dimension of transcription- and prediction-network-generated encodings passed to joint network.

  • time_reduction_stride (int) – factor by which to reduce length of input sequence.

  • conformer_input_dim (int) – dimension of Conformer input.

  • conformer_ffn_dim (int) – hidden layer dimension of each Conformer layer’s feedforward network.

  • conformer_num_layers (int) – number of Conformer layers to instantiate.

  • conformer_num_heads (int) – number of attention heads in each Conformer layer.

  • conformer_depthwise_conv_kernel_size (int) – kernel size of each Conformer layer’s depthwise convolution layer.

  • conformer_dropout (float) – Conformer dropout probability.

  • num_symbols (int) – cardinality of set of target tokens.

  • symbol_embedding_dim (int) – dimension of each target token embedding.

  • num_lstm_layers (int) – number of LSTM layers to instantiate.

  • lstm_hidden_dim (int) – output dimension of each LSTM layer.

  • lstm_layer_norm (bool) – if True, enables layer normalization for LSTM layers.

  • lstm_layer_norm_epsilon (float) – value of epsilon to use in LSTM layer normalization layers.

  • lstm_dropout (float) – LSTM dropout probability.

  • joiner_activation (str) – activation function to use in the joiner. Must be one of (“relu”, “tanh”). (Default: “relu”)

  • Returns

    RNNT:

    Conformer RNN-T model.

conformer_rnnt_base

torchaudio.prototype.models.conformer_rnnt_base() RNNT[source]

Builds basic version of Conformer RNN-T model.

Returns:

Conformer RNN-T model.

Return type:

RNNT

emformer_hubert_model

torchaudio.prototype.models.emformer_hubert_model(extractor_input_dim: int, extractor_output_dim: int, extractor_use_bias: bool, extractor_stride: int, encoder_input_dim: int, encoder_output_dim: int, encoder_num_heads: int, encoder_ffn_dim: int, encoder_num_layers: int, encoder_segment_length: int, encoder_left_context_length: int, encoder_right_context_length: int, encoder_dropout: float, encoder_activation: str, encoder_max_memory_size: int, encoder_weight_init_scale_strategy: Optional[str], encoder_tanh_on_mem: bool, aux_num_out: Optional[int]) Wav2Vec2Model[source]

Build a custom Emformer HuBERT model.

Parameters:
  • extractor_input_dim (int) – The input dimension for feature extractor.

  • extractor_output_dim (int) – The output dimension after feature extractor.

  • extractor_use_bias (bool) – If True, enable bias parameter in the linear layer of feature extractor.

  • extractor_stride (int) – Number of frames to merge for the output frame in feature extractor.

  • encoder_input_dim (int) – The input dimension for Emformer layer.

  • encoder_output_dim (int) – The output dimension after EmformerEncoder.

  • encoder_num_heads (int) – Number of attention heads in each Emformer layer.

  • encoder_ffn_dim (int) – Hidden layer dimension of feedforward network in Emformer.

  • encoder_num_layers (int) – Number of Emformer layers to instantiate.

  • encoder_segment_length (int) – Length of each input segment.

  • encoder_left_context_length (int) – Length of left context.

  • encoder_right_context_length (int) – Length of right context.

  • encoder_dropout (float) – Dropout probability.

  • encoder_activation (str) – Activation function to use in each Emformer layer’s feedforward network. Must be one of (“relu”, “gelu”, “silu”).

  • encoder_max_memory_size (int) – Maximum number of memory elements to use.

  • encoder_weight_init_scale_strategy (str or None) – Per-layer weight initialization scaling strategy. Must be one of (“depthwise”, “constant”, None).

  • encoder_tanh_on_mem (bool) – If True, applies tanh to memory elements.

  • aux_num_out (int or None) – When provided, attach an extra linear layer on top of encoder, which can be used for fine-tuning.

Returns:

The resulting torchaudio.models.Wav2Vec2Model model with a torchaudio.models.Emformer encoder.

Return type:

Wav2Vec2Model

emformer_hubert_base

torchaudio.prototype.models.emformer_hubert_base(extractor_input_dim: int = 80, extractor_output_dim: int = 128, encoder_dropout: float = 0.1, aux_num_out: Optional[int] = None) Wav2Vec2Model[source]

Build Emformer HuBERT Model with 20 Emformer layers.

Parameters:
  • extractor_input_dim (int, optional) – The input dimension for feature extractor. (Default: 80)

  • extractor_output_dim (int, optional) – The output dimension after feature extractor. (Default: 128)

  • encoder_dropout (float, optional) – Dropout probability in Emformer. (Default: 0.1)

  • aux_num_out (int or None, optional) – Output dimension of aux layer for fine-tuning. (Default: None)

Returns:

The resulting torchaudio.models.Wav2Vec2Model model with a torchaudio.models.Emformer encoder.

Return type:

Wav2Vec2Model

ConvEmformer

class torchaudio.prototype.models.ConvEmformer(input_dim: int, num_heads: int, ffn_dim: int, num_layers: int, segment_length: int, kernel_size: int, dropout: float = 0.0, ffn_activation: str = 'relu', left_context_length: int = 0, right_context_length: int = 0, max_memory_size: int = 0, weight_init_scale_strategy: Optional[str] = 'depthwise', tanh_on_mem: bool = False, negative_inf: float = -100000000.0, conv_activation: str = 'silu')[source]

Implements the convolution-augmented streaming transformer architecture introduced in Streaming Transformer Transducer based Speech Recognition Using Non-Causal Convolution [Shi et al., 2022].

Parameters:
  • input_dim (int) – input dimension.

  • num_heads (int) – number of attention heads in each ConvEmformer layer.

  • ffn_dim (int) – hidden layer dimension of each ConvEmformer layer’s feedforward network.

  • num_layers (int) – number of ConvEmformer layers to instantiate.

  • segment_length (int) – length of each input segment.

  • kernel_size (int) – size of kernel to use in convolution modules.

  • dropout (float, optional) – dropout probability. (Default: 0.0)

  • ffn_activation (str, optional) – activation function to use in feedforward networks. Must be one of (“relu”, “gelu”, “silu”). (Default: “relu”)

  • left_context_length (int, optional) – length of left context. (Default: 0)

  • right_context_length (int, optional) – length of right context. (Default: 0)

  • max_memory_size (int, optional) – maximum number of memory elements to use. (Default: 0)

  • weight_init_scale_strategy (str or None, optional) – per-layer weight initialization scaling strategy. Must be one of (“depthwise”, “constant”, None). (Default: “depthwise”)

  • tanh_on_mem (bool, optional) – if True, applies tanh to memory elements. (Default: False)

  • negative_inf (float, optional) – value to use for negative infinity in attention weights. (Default: -1e8)

  • conv_activation (str, optional) – activation function to use in convolution modules. Must be one of (“relu”, “gelu”, “silu”). (Default: “silu”)

Examples

>>> conv_emformer = ConvEmformer(80, 4, 1024, 12, 16, 8, right_context_length=4)
>>> input = torch.rand(10, 200, 80)
>>> lengths = torch.randint(1, 200, (10,))
>>> output, lengths = conv_emformer(input, lengths)
>>> input = torch.rand(4, 20, 80)
>>> lengths = torch.ones(4) * 20
>>> output, lengths, states = conv_emformer.infer(input, lengths, None)
forward(input: Tensor, lengths: Tensor) Tuple[Tensor, Tensor]

Forward pass for training and non-streaming inference.

B: batch size; T: max number of input frames in batch; D: feature dimension of each frame.

Parameters:
  • input (torch.Tensor) – utterance frames right-padded with right context frames, with shape (B, T + right_context_length, D).

  • lengths (torch.Tensor) – with shape (B,) and i-th element representing number of valid utterance frames for i-th batch element in input.

Returns:

Tensor

output frames, with shape (B, T, D).

Tensor

output lengths, with shape (B,) and i-th element representing number of valid frames for i-th batch element in output frames.

Return type:

(Tensor, Tensor)

infer(input: Tensor, lengths: Tensor, states: Optional[List[List[Tensor]]] = None) Tuple[Tensor, Tensor, List[List[Tensor]]]

Forward pass for streaming inference.

B: batch size; D: feature dimension of each frame.

Parameters:
  • input (torch.Tensor) – utterance frames right-padded with right context frames, with shape (B, segment_length + right_context_length, D).

  • lengths (torch.Tensor) – with shape (B,) and i-th element representing number of valid frames for i-th batch element in input.

  • states (List[List[torch.Tensor]] or None, optional) – list of lists of tensors representing internal state generated in preceding invocation of infer. (Default: None)

Returns:

Tensor

output frames, with shape (B, segment_length, D).

Tensor

output lengths, with shape (B,) and i-th element representing number of valid frames for i-th batch element in output frames.

List[List[Tensor]]

output states; list of lists of tensors representing internal state generated in current invocation of infer.

Return type:

(Tensor, Tensor, List[List[Tensor]])

ConformerWav2Vec2PretrainModel

class torchaudio.prototype.models.ConformerWav2Vec2PretrainModel(wav2vec2: Wav2Vec2Model, mask_generator: Module, negative_sampler: Module)[source]

Conformer Wav2Vec2 pre-train model for training from scratch.

Note

To build the model, please use one of the factory functions, conformer_wav2vec2_base() or conformer_wav2vec2_large()

Parameters:
  • wav2vec2 (nn.Module) – Conformer based Wav2Vec2 model, including feature extractor and conformer encoder components.

  • mask_generator (nn.Module) – Mask generator that generates the mask for masked prediction during training.

  • negative_sampler (nn.Module) – Negative sampler to apply after masking.

forward(features: Tensor, audio_lengths: Optional[Tensor] = None) Tuple[Tensor, Optional[Tensor], Tensor, Tensor][source]
Parameters:
  • features (Tensor) – Tensor of audio features of shape (batch, frame, dim).

  • audio_lengths (Tensor or None, optional) – Tensor of valid length of each valid auidio in the batch. shape: (batch, ) (Default: None)

Returns:

Tensor

The masked sequences of probability distribution of shape (batch, frame dim).

Tensor or None

If lengths argument was provided, a Tensor of shape (batch, ) representing valid length in time axis is returns.

Tensor

The mask indices.

Tensor

The targets, prior to negative sampling.

Tensor

The negative samples.

Tensor

The indices of the negative samples.

Return type:

(Tensor, Optional[Tensor], Tensor, Tensor, Tensor, Tensor)

conformer_wav2vec2_model

torchaudio.prototype.models.conformer_wav2vec2_model(extractor_input_dim: int, extractor_output_dim: int, extractor_stride: int, encoder_embed_dim: int, encoder_projection_dropout: float, encoder_num_layers: int, encoder_num_heads: int, encoder_ff_interm_features: int, encoder_depthwise_conv_kernel_size: Union[int, List[int]], encoder_dropout: float, encoder_convolution_first: bool, encoder_use_group_norm: bool) Wav2Vec2Model[source]

Build a custom Conformer Wav2Vec2Model

Parameters:
  • extractor_input_dim (int) – Input dimension of the features.

  • extractor_output_dim (int) – Output dimension after feature extraction.

  • extractor_stride (int) – Stride used in time reduction layer of feature extraction.

  • encoder_embed_dim (int) – The dimension of the embedding in the feature projection.

  • encoder_projection_dropout (float) – The dropout probability applied after the input feature is projected to embed_dim

  • encoder_num_layers (int) – Number of Conformer layers in the encoder.

  • encoder_num_heads (int) – Number of heads in each Conformer layer.

  • encoder_ff_interm_features (int) – Hidden layer dimension of the feedforward network in each Conformer layer.

  • encoder_depthwise_conv_kernel_size (int or List[int]) – List of kernel sizes corresponding to each of the Conformer layers. If int is provided, all layers will have the same kernel size.

  • encoder_dropout (float) – Dropout probability in each Conformer layer.

  • encoder_convolution_first (bool) – Whether to apply the convolution module ahead of the attention module in each Conformer layer.

  • encoder_use_group_norm (bool) – Whether to use GroupNorm rather than BatchNorm1d in the convolution module in each Conformer layer.

Returns:

The resulting wav2vec2 model with a conformer encoder.

Return type:

Wav2Vec2Model

conformer_wav2vec2_base

torchaudio.prototype.models.conformer_wav2vec2_base(extractor_input_dim: int = 64, extractor_output_dim: int = 256, encoder_projection_dropout: float = 0.0) Wav2Vec2Model[source]

Build Conformer Wav2Vec2 Model with “small” architecture from Conformer-Based Slef-Supervised Learning for Non-Speech Audio Tasks [Srivastava et al., 2022]

Parameters:
  • extractor_input_dim (int, optional) – Input dimension of feature extractor. (Default: 64)

  • extractor_output_dim (int, optional) – Output dimension of feature extractor. (Default: 256)

  • encoder_projection_dropout (float, optional) – Dropout probability applied after feature projection. (Default: 0.0)

Returns:

The resulting wav2vec2 model with a conformer encoder and base configuration.

Return type:

Wav2Vec2Model

conformer_wav2vec2_pretrain_model

torchaudio.prototype.models.conformer_wav2vec2_pretrain_model(extractor_input_dim: int, extractor_output_dim: int, extractor_stride: int, encoder_embed_dim: int, encoder_projection_dropout: float, encoder_num_layers: int, encoder_num_heads: int, encoder_ff_interm_features: int, encoder_depthwise_conv_kernel_size: int, encoder_dropout: float, encoder_convolution_first: bool, encoder_use_group_norm: bool, mask_prob: float, mask_selection: str, mask_other: float, mask_length: int, no_mask_overlap: bool, mask_min_space: int, mask_channel_prob: float, mask_channel_selection: str, mask_channel_other: float, mask_channel_length: int, no_mask_channel_overlap: bool, mask_channel_min_space: int, num_negatives: int, cross_sample_negatives: int) ConformerWav2Vec2PretrainModel[source]

Build a custom Conformer Wav2Vec2 Model for pre-training

Parameters:
  • extractor_input_dim (int) – Input dimension of the features.

  • extractor_output_dim (int) – Output dimension after feature extraction.

  • extractor_stride (int) – Stride used in time reduction layer of feature extraction.

  • encoder_embed_dim (int) – The dimension of the embedding in the feature projection.

  • encoder_projection_dropout (float) – The dropout probability applied after the input feature is projected to embed_dim

  • encoder_num_layers (int) – Number of Conformer layers in the encoder.

  • encoder_num_heads (int) – Number of heads in each Conformer layer.

  • encoder_ff_interm_features (int) – Hidden layer dimension of the feedforward network in each Conformer layer.

  • encoder_depthwise_conv_kernel_size (int or List[int]) – List of kernel sizes corresponding to each of the Conformer layers. If int is provided, all layers will have the same kernel size.

  • encoder_dropout (float) – Dropout probability in each Conformer layer.

  • encoder_convolution_first (bool) – Whether to apply the convolution module ahead of the attention module in each Conformer layer.

  • encoder_use_group_norm (bool) – Whether to use GroupNorm rather than BatchNorm1d in the convolution module in each Conformer layer.

  • mask_prob (float) – Probability for each token to be chosen as start of the span to be masked.

  • mask_selection (str) – How to choose the mask length. Options: [static, uniform, normal, poisson].

  • mask_other (float) – Secondary mask argument (used for more complex distributions).

  • mask_length (int) – The lengths of the mask.

  • no_mask_overlap (bool) – Whether to allow masks to overlap.

  • mask_min_space (int) – Minimum space between spans (if no overlap is enabled).

  • mask_channel_prob – (float): The probability of replacing a feature with 0.

  • mask_channel_selection (str) – How to choose the mask length for channel masking. Options: [static, uniform, normal, poisson].

  • mask_channel_other (float) – Secondary mask argument for channel masking (used for more complex distributions).

  • mask_channel_length (int) – Minimum space between spans (if no overlap is enabled) for channel masking.

  • no_mask_channel_overlap (bool) – Whether to allow channel masks to overlap.

  • mask_channel_min_space (int) – Minimum space between spans for channel masking (if no overlap is enabled).

  • num_negatives (int) – Number of negatives to sample.

  • cross_sample_negatives (int) – Number of cross sampled negatives.

Returns:

The resulting model.

Return type:

ConformerWav2Vec2PretrainModel

conformer_wav2vec2_pretrain_base

torchaudio.prototype.models.conformer_wav2vec2_pretrain_base(extractor_input_dim: int = 64, extractor_output_dim: int = 256, encoder_projection_dropout: float = 0.0, mask_prob: float = 0.3, mask_length: int = 3, num_negatives: int = 100, cross_sample_negatives: int = 0) ConformerWav2Vec2PretrainModel[source]

Build Conformer Wav2Vec2 Model for pre-training with “small” architecture from Conformer-Based Self-Supervised Learning for Non-Speech Audio Tasks [Srivastava et al., 2022]

Parameters:
  • extractor_input_dim (int, optional) – Input dimension of the features. (Default: 64)

  • extractor_output_dim (int, optional) – Output dimension after feature extraction. (Default: 256)

  • encoder_projection_dropout (float, optional) – The dropout probability applied after the input feature is projected to embed_dim. (Default: 0.0)

  • mask_prob (float, optional) – Probability for each token to be chosen as start of the span to be masked. (Default: 0.3)

  • mask_length (int, optional) – The lengths of the mask. (Default: 3)

  • num_negatives (int, optional) – Number of sampled negatives. (Default: 0)

  • cross_sample_negatives (int, optional) – Number of cross sampled negatives. (Default: 0)

Returns:

The resulting model.

Return type:

ConformerWav2Vec2PretrainModel

conformer_wav2vec2_pretrain_large

torchaudio.prototype.models.conformer_wav2vec2_pretrain_large(extractor_input_dim: int = 64, extractor_output_dim: int = 256, encoder_projection_dropout: float = 0.0, mask_prob: float = 0.3, mask_length: int = 3, num_negatives: int = 100, cross_sample_negatives: int = 0) ConformerWav2Vec2PretrainModel[source]

Build Conformer Wav2Vec2 Model for pre-training with “large” architecture from Conformer-Based Slef-Supervised Learning for Non-Speech Audio Tasks [Srivastava et al., 2022]

Parameters:
  • extractor_input_dim (int, optional) – Input dimension of the features. (Default: 64)

  • extractor_output_dim (int, optional) – Output dimension after feature extraction. (Default: 256)

  • encoder_projection_dropout (float, optional) – The dropout probability applied after the input feature is projected to embed_dim. (Default: 0.0)

  • mask_prob (float, optional) – Probability for each token to be chosen as start of the span to be masked. (Default: 0.3)

  • mask_length (int, optional) – The lengths of the mask. (Default: 3)

  • num_negatives (int, optional) – Number of sampled negatives. (Default: 0)

  • cross_sample_negatives (int, optional) – Number of cross sampled negatives. (Default: 0)

Returns:

The resulting model.

Return type:

ConformerWav2Vec2PretrainModel

HiFiGANVocoder

class torchaudio.prototype.models.HiFiGANVocoder(in_channels: int, upsample_rates: Tuple[int, ...], upsample_initial_channel: int, upsample_kernel_sizes: Tuple[int, ...], resblock_kernel_sizes: Tuple[int, ...], resblock_dilation_sizes: Tuple[Tuple[int, ...], ...], resblock_type: int, lrelu_slope: float)[source]

Generator part of HiFi GAN [Kong et al., 2020]. Source: https://github.com/jik876/hifi-gan/blob/4769534d45265d52a904b850da5a622601885777/models.py#L75

Note

To build the model, please use one of the factory functions: hifigan_vocoder(), hifigan_vocoder_v1(), hifigan_vocoder_v2(), hifigan_vocoder_v3().

Parameters:
  • in_channels (int) – Number of channels in the input features.

  • upsample_rates (tuple of int) – Factors by which each upsampling layer increases the time dimension.

  • upsample_initial_channel (int) – Number of channels in the input feature tensor.

  • upsample_kernel_sizes (tuple of int) – Kernel size for each upsampling layer.

  • resblock_kernel_sizes (tuple of int) – Kernel size for each residual block.

  • resblock_dilation_sizes (tuple of tuples of int) – Dilation sizes for each 1D convolutional layer in each residual block. For resblock type 1 inner tuples should have length 3, because there are 3 convolutions in each layer. For resblock type 2 they should have length 2.

  • resblock_type (int, 1 or 2) – Determines whether ResBlock1 or ResBlock2 will be used.

  • lrelu_slope (float) – Slope of leaky ReLUs in activations.

forward(x: Tensor) Tensor[source]
Parameters:

x (Tensor) – Feature input tensor of shape (batch_size, num_channels, time_length).

Returns:

Tensor of shape (batch_size, 1, time_length * upsample_rate), where upsample_rate is the product of upsample rates for all layers.

hifigan_vocoder

torchaudio.prototype.models.hifigan_vocoder(in_channels: int, upsample_rates: Tuple[int, ...], upsample_initial_channel: int, upsample_kernel_sizes: Tuple[int, ...], resblock_kernel_sizes: Tuple[int, ...], resblock_dilation_sizes: Tuple[Tuple[int, ...], ...], resblock_type: int, lrelu_slope: float) HiFiGANVocoder[source]

Builds HiFi GAN Vocoder [Kong et al., 2020].

Parameters:
Returns:

generated model.

Return type:

HiFiGANVocoder

hifigan_vocoder_v1

torchaudio.prototype.models.hifigan_vocoder_v1() HiFiGANVocoder[source]

Builds HiFiGAN Vocoder with V1 architecture [Kong et al., 2020].

Returns:

generated model.

Return type:

HiFiGANVocoder

hifigan_vocoder_v2

torchaudio.prototype.models.hifigan_vocoder_v2() HiFiGANVocoder[source]

Builds HiFiGAN Vocoder with V2 architecture [Kong et al., 2020].

Returns:

generated model.

Return type:

HiFiGANVocoder

hifigan_vocoder_v3

torchaudio.prototype.models.hifigan_vocoder_v3() HiFiGANVocoder[source]

Builds HiFiGAN Vocoder with V3 architecture [Kong et al., 2020].

Returns:

generated model.

Return type:

HiFiGANVocoder

Docs

Access comprehensive developer documentation for PyTorch

View Docs

Tutorials

Get in-depth tutorials for beginners and advanced developers

View Tutorials

Resources

Find development resources and get your questions answered

View Resources