torchaudio.models.emformer_rnnt_model(*, input_dim: int, encoding_dim: int, num_symbols: int, segment_length: int, right_context_length: int, time_reduction_input_dim: int, time_reduction_stride: int, transformer_num_heads: int, transformer_ffn_dim: int, transformer_num_layers: int, transformer_dropout: float, transformer_activation: str, transformer_left_context_length: int, transformer_max_memory_size: int, transformer_weight_init_scale_strategy: str, transformer_tanh_on_mem: bool, symbol_embedding_dim: int, num_lstm_layers: int, lstm_layer_norm: bool, lstm_layer_norm_epsilon: float, lstm_dropout: float) RNNT[source]

Builds Emformer-based RNNT.


For non-streaming inference, the expectation is for transcribe to be called on input sequences right-concatenated with right_context_length frames.

For streaming inference, the expectation is for transcribe_streaming to be called on input chunks comprising segment_length frames right-concatenated with right_context_length frames.

  • input_dim (int) – dimension of input sequence frames passed to transcription network.

  • encoding_dim (int) – dimension of transcription- and prediction-network-generated encodings passed to joint network.

  • num_symbols (int) – cardinality of set of target tokens.

  • segment_length (int) – length of input segment expressed as number of frames.

  • right_context_length (int) – length of right context expressed as number of frames.

  • time_reduction_input_dim (int) – dimension to scale each element in input sequences to prior to applying time reduction block.

  • time_reduction_stride (int) – factor by which to reduce length of input sequence.

  • transformer_num_heads (int) – number of attention heads in each Emformer layer.

  • transformer_ffn_dim (int) – hidden layer dimension of each Emformer layer’s feedforward network.

  • transformer_num_layers (int) – number of Emformer layers to instantiate.

  • transformer_left_context_length (int) – length of left context considered by Emformer.

  • transformer_dropout (float) – Emformer dropout probability.

  • transformer_activation (str) – activation function to use in each Emformer layer’s feedforward network. Must be one of (“relu”, “gelu”, “silu”).

  • transformer_max_memory_size (int) – maximum number of memory elements to use.

  • transformer_weight_init_scale_strategy (str) – per-layer weight initialization scaling strategy. Must be one of (“depthwise”, “constant”, None).

  • transformer_tanh_on_mem (bool) – if True, applies tanh to memory elements.

  • symbol_embedding_dim (int) – dimension of each target token embedding.

  • num_lstm_layers (int) – number of LSTM layers to instantiate.

  • lstm_layer_norm (bool) – if True, enables layer normalization for LSTM layers.

  • lstm_layer_norm_epsilon (float) – value of epsilon to use in LSTM layer normalization layers.

  • lstm_dropout (float) – LSTM dropout probability.


Emformer RNN-T model.

Return type:



Access comprehensive developer documentation for PyTorch

View Docs


Get in-depth tutorials for beginners and advanced developers

View Tutorials


Find development resources and get your questions answered

View Resources