Shortcuts

torchaudio.prototype.models

conformer_rnnt_model

torchaudio.prototype.models.conformer_rnnt_model(*, input_dim: int, encoding_dim: int, time_reduction_stride: int, conformer_input_dim: int, conformer_ffn_dim: int, conformer_num_layers: int, conformer_num_heads: int, conformer_depthwise_conv_kernel_size: int, conformer_dropout: float, num_symbols: int, symbol_embedding_dim: int, num_lstm_layers: int, lstm_hidden_dim: int, lstm_layer_norm: int, lstm_layer_norm_epsilon: int, lstm_dropout: int, joiner_activation: str) RNNT[source]

Builds Conformer-based recurrent neural network transducer (RNN-T) model.

Parameters:
  • input_dim (int) – dimension of input sequence frames passed to transcription network.

  • encoding_dim (int) – dimension of transcription- and prediction-network-generated encodings passed to joint network.

  • time_reduction_stride (int) – factor by which to reduce length of input sequence.

  • conformer_input_dim (int) – dimension of Conformer input.

  • conformer_ffn_dim (int) – hidden layer dimension of each Conformer layer’s feedforward network.

  • conformer_num_layers (int) – number of Conformer layers to instantiate.

  • conformer_num_heads (int) – number of attention heads in each Conformer layer.

  • conformer_depthwise_conv_kernel_size (int) – kernel size of each Conformer layer’s depthwise convolution layer.

  • conformer_dropout (float) – Conformer dropout probability.

  • num_symbols (int) – cardinality of set of target tokens.

  • symbol_embedding_dim (int) – dimension of each target token embedding.

  • num_lstm_layers (int) – number of LSTM layers to instantiate.

  • lstm_hidden_dim (int) – output dimension of each LSTM layer.

  • lstm_layer_norm (bool) – if True, enables layer normalization for LSTM layers.

  • lstm_layer_norm_epsilon (float) – value of epsilon to use in LSTM layer normalization layers.

  • lstm_dropout (float) – LSTM dropout probability.

  • joiner_activation (str) – activation function to use in the joiner. Must be one of (“relu”, “tanh”). (Default: “relu”)

  • Returns

    RNNT:

    Conformer RNN-T model.

conformer_rnnt_base

torchaudio.prototype.models.conformer_rnnt_base() RNNT[source]

Builds basic version of Conformer RNN-T model.

Returns:

Conformer RNN-T model.

Return type:

RNNT

emformer_hubert_model

torchaudio.prototype.models.emformer_hubert_model(extractor_input_dim: int, extractor_output_dim: int, extractor_use_bias: bool, extractor_stride: int, encoder_input_dim: int, encoder_output_dim: int, encoder_num_heads: int, encoder_ffn_dim: int, encoder_num_layers: int, encoder_segment_length: int, encoder_left_context_length: int, encoder_right_context_length: int, encoder_dropout: float, encoder_activation: str, encoder_max_memory_size: int, encoder_weight_init_scale_strategy: Optional[str], encoder_tanh_on_mem: bool) Wav2Vec2Model[source]

Build a custom Emformer HuBERT model.

Parameters:
  • extractor_input_dim (int) – The input dimension for feature extractor.

  • extractor_output_dim (int) – The output dimension after feature extractor.

  • extractor_use_bias (bool) – If True, enable bias parameter in the linear layer of feature extractor.

  • extractor_stride (int) – Number of frames to merge for the output frame in feature extractor.

  • encoder_input_dim (int) – The input dimension for Emformer layer.

  • encoder_output_dim (int) – The output dimension after EmformerEncoder.

  • encoder_num_heads (int) – Number of attention heads in each Emformer layer.

  • encoder_ffn_dim (int) – Hidden layer dimension of feedforward network in Emformer.

  • encoder_num_layers (int) – Number of Emformer layers to instantiate.

  • encoder_segment_length (int) – Length of each input segment.

  • encoder_left_context_length (int) – Length of left context.

  • encoder_right_context_length (int) – Length of right context.

  • encoder_dropout (float) – Dropout probability.

  • encoder_activation (str) – Activation function to use in each Emformer layer’s feedforward network. Must be one of (“relu”, “gelu”, “silu”).

  • encoder_max_memory_size (int) – Maximum number of memory elements to use.

  • encoder_weight_init_scale_strategy (str or None) – Per-layer weight initialization scaling strategy. Must be one of (“depthwise”, “constant”, None).

  • encoder_tanh_on_mem (bool) – If True, applies tanh to memory elements.

Returns:

The resulting torchaudio.models.Wav2Vec2Model model with a torchaudio.models.Emformer encoder.

Return type:

Wav2Vec2Model

emformer_hubert_base

torchaudio.prototype.models.emformer_hubert_base(extractor_input_dim: int = 80, extractor_output_dim: int = 128, encoder_dropout: float = 0.1) Wav2Vec2Model[source]

Build Emformer HuBERT Model with 20 Emformer layers.

Parameters:
  • extractor_input_dim (int, optional) – The input dimension for feature extractor. (Default: 80)

  • extractor_output_dim (int, optional) – The output dimension after feature extractor. (Default: 128)

  • encoder_dropout (float, optional) – Dropout probability in Emformer. (Default: 0.1)

Returns:

The resulting torchaudio.models.Wav2Vec2Model model with a torchaudio.models.Emformer encoder.

Return type:

Wav2Vec2Model

ConvEmformer

class torchaudio.prototype.models.ConvEmformer(input_dim: int, num_heads: int, ffn_dim: int, num_layers: int, segment_length: int, kernel_size: int, dropout: float = 0.0, ffn_activation: str = 'relu', left_context_length: int = 0, right_context_length: int = 0, max_memory_size: int = 0, weight_init_scale_strategy: Optional[str] = 'depthwise', tanh_on_mem: bool = False, negative_inf: float = -100000000.0, conv_activation: str = 'silu')[source]

Implements the convolution-augmented streaming transformer architecture introduced in Streaming Transformer Transducer based Speech Recognition Using Non-Causal Convolution [Shi et al., 2022].

Parameters:
  • input_dim (int) – input dimension.

  • num_heads (int) – number of attention heads in each ConvEmformer layer.

  • ffn_dim (int) – hidden layer dimension of each ConvEmformer layer’s feedforward network.

  • num_layers (int) – number of ConvEmformer layers to instantiate.

  • segment_length (int) – length of each input segment.

  • kernel_size (int) – size of kernel to use in convolution modules.

  • dropout (float, optional) – dropout probability. (Default: 0.0)

  • ffn_activation (str, optional) – activation function to use in feedforward networks. Must be one of (“relu”, “gelu”, “silu”). (Default: “relu”)

  • left_context_length (int, optional) – length of left context. (Default: 0)

  • right_context_length (int, optional) – length of right context. (Default: 0)

  • max_memory_size (int, optional) – maximum number of memory elements to use. (Default: 0)

  • weight_init_scale_strategy (str or None, optional) – per-layer weight initialization scaling strategy. Must be one of (“depthwise”, “constant”, None). (Default: “depthwise”)

  • tanh_on_mem (bool, optional) – if True, applies tanh to memory elements. (Default: False)

  • negative_inf (float, optional) – value to use for negative infinity in attention weights. (Default: -1e8)

  • conv_activation (str, optional) – activation function to use in convolution modules. Must be one of (“relu”, “gelu”, “silu”). (Default: “silu”)

Examples

>>> conv_emformer = ConvEmformer(80, 4, 1024, 12, 16, 8, right_context_length=4)
>>> input = torch.rand(10, 200, 80)
>>> lengths = torch.randint(1, 200, (10,))
>>> output, lengths = conv_emformer(input, lengths)
>>> input = torch.rand(4, 20, 80)
>>> lengths = torch.ones(4) * 20
>>> output, lengths, states = conv_emformer.infer(input, lengths, None)
forward(input: Tensor, lengths: Tensor) Tuple[Tensor, Tensor]

Forward pass for training and non-streaming inference.

B: batch size; T: max number of input frames in batch; D: feature dimension of each frame.

Parameters:
  • input (torch.Tensor) – utterance frames right-padded with right context frames, with shape (B, T + right_context_length, D).

  • lengths (torch.Tensor) – with shape (B,) and i-th element representing number of valid utterance frames for i-th batch element in input.

Returns:

Tensor

output frames, with shape (B, T, D).

Tensor

output lengths, with shape (B,) and i-th element representing number of valid frames for i-th batch element in output frames.

Return type:

(Tensor, Tensor)

infer(input: Tensor, lengths: Tensor, states: Optional[List[List[Tensor]]] = None) Tuple[Tensor, Tensor, List[List[Tensor]]]

Forward pass for streaming inference.

B: batch size; D: feature dimension of each frame.

Parameters:
  • input (torch.Tensor) – utterance frames right-padded with right context frames, with shape (B, segment_length + right_context_length, D).

  • lengths (torch.Tensor) – with shape (B,) and i-th element representing number of valid frames for i-th batch element in input.

  • states (List[List[torch.Tensor]] or None, optional) – list of lists of tensors representing internal state generated in preceding invocation of infer. (Default: None)

Returns:

Tensor

output frames, with shape (B, segment_length, D).

Tensor

output lengths, with shape (B,) and i-th element representing number of valid frames for i-th batch element in output frames.

List[List[Tensor]]

output states; list of lists of tensors representing internal state generated in current invocation of infer.

Return type:

(Tensor, Tensor, List[List[Tensor]])

conformer_wav2vec2_model

torchaudio.prototype.models.conformer_wav2vec2_model(extractor_input_dim: int, extractor_output_dim: int, extractor_stride: int, encoder_embed_dim: int, encoder_projection_dropout: float, encoder_num_layers: int, encoder_num_heads: int, encoder_ff_interm_features: int, encoder_depthwise_conv_kernel_size: Union[int, List[int]], encoder_dropout: float, encoder_convolution_first: bool, encoder_use_group_norm: bool) Wav2Vec2Model[source]

Build a custom Conformer Wav2Vec2Model

Parameters:
  • extractor_input_dim (int) – Input dimension of the features.

  • extractor_output_dim (int) – Output dimension after feature extraction.

  • extractor_stride (int) – Stride used in time reduction layer of feature extraction.

  • encoder_embed_dim (int) – The dimension of the embedding in the feature projection.

  • encoder_projection_dropout (float) – The dropout probability applied after the input feature is projected to embed_dim

  • encoder_num_layers (int) – Number of Conformer layers in the encoder.

  • encoder_num_heads (int) – Number of heads in each Conformer layer.

  • encoder_ff_interm_features (int) – Hidden layer dimension of the feedforward network in each Conformer layer.

  • encoder_depthwise_conv_kernel_size (int or List[int]) – List of kernel sizes corresponding to each of the Conformer layers. If int is provided, all layers will have the same kernel size.

  • encoder_dropout (float) – Dropout probability in each Conformer layer.

  • encoder_convolution_first (bool) – Whether to apply the convolution module ahead of the attention module in each Conformer layer.

  • encoder_use_group_norm (bool) – Whether to use GroupNorm rather than BatchNorm1d in the convolution module in each Conformer layer.

Returns:

The resulting wav2vec2 model with a conformer encoder.

Return type:

Wav2Vec2Model

conformer_wav2vec2_base

torchaudio.prototype.models.conformer_wav2vec2_base(extractor_input_dim: int = 64, extractor_output_dim: int = 256, encoder_projection_dropout: float = 0.0) Wav2Vec2Model[source]

Build Conformer Wav2Vec2 Model with “small” architecture from Conformer-Based Slef-Supervised Learning for Non-Speech Audio Tasks [Srivastava et al., 2022]

Parameters:
  • extractor_input_dim (int, optional) – Input dimension of feature extractor. (Default: 64)

  • extractor_output_dim (int, optional) – Output dimension of feature extractor. (Default: 256)

  • encoder_projection_dropout (float, optional) – Dropout probability applied after feature projection. (Default: 0.0)

Returns:

The resulting wav2vec2 model with a conformer encoder and base configuration.

Return type:

Wav2Vec2Model

Docs

Access comprehensive developer documentation for PyTorch

View Docs

Tutorials

Get in-depth tutorials for beginners and advanced developers

View Tutorials

Resources

Find development resources and get your questions answered

View Resources