Conformer¶
- class torchaudio.models.Conformer(input_dim: int, num_heads: int, ffn_dim: int, num_layers: int, depthwise_conv_kernel_size: int, dropout: float = 0.0, use_group_norm: bool = False, convolution_first: bool = False)[source]¶
Conformer architecture introduced in Conformer: Convolution-augmented Transformer for Speech Recognition [Gulati et al., 2020].
- Parameters:
input_dim (int) – input dimension.
num_heads (int) – number of attention heads in each Conformer layer.
ffn_dim (int) – hidden layer dimension of feedforward networks.
num_layers (int) – number of Conformer layers to instantiate.
depthwise_conv_kernel_size (int) – kernel size of each Conformer layer’s depthwise convolution layer.
dropout (float, optional) – dropout probability. (Default: 0.0)
use_group_norm (bool, optional) – use
GroupNorm
rather thanBatchNorm1d
in the convolution module. (Default:False
)convolution_first (bool, optional) – apply the convolution module ahead of the attention module. (Default:
False
)
Examples
>>> conformer = Conformer( >>> input_dim=80, >>> num_heads=4, >>> ffn_dim=128, >>> num_layers=4, >>> depthwise_conv_kernel_size=31, >>> ) >>> lengths = torch.randint(1, 400, (10,)) # (batch,) >>> input = torch.rand(10, int(lengths.max()), input_dim) # (batch, num_frames, input_dim) >>> output = conformer(input, lengths)
forward¶
- Conformer.forward(input: Tensor, lengths: Tensor) Tuple[Tensor, Tensor] [source]¶
- Parameters:
input (torch.Tensor) – with shape (B, T, input_dim).
lengths (torch.Tensor) – with shape (B,) and i-th element representing number of valid frames for i-th batch element in
input
.
- Returns:
- (torch.Tensor, torch.Tensor)
- torch.Tensor
output frames, with shape (B, T, input_dim)
- torch.Tensor
output lengths, with shape (B,) and i-th element representing number of valid frames for i-th batch element in output frames.