torchaudio.models.hubert_pretrain_model
- torchaudio.models.hubert_pretrain_model(extractor_mode: str, extractor_conv_layer_config: Optional[List[Tuple[int, int, int]]], extractor_conv_bias: bool, encoder_embed_dim: int, encoder_projection_dropout: float, encoder_pos_conv_kernel: int, encoder_pos_conv_groups: int, encoder_num_layers: int, encoder_num_heads: int, encoder_attention_dropout: float, encoder_ff_interm_features: int, encoder_ff_interm_dropout: float, encoder_dropout: float, encoder_layer_norm_first: bool, encoder_layer_drop: float, mask_prob: float, mask_selection: str, mask_other: float, mask_length: int, no_mask_overlap: bool, mask_min_space: int, mask_channel_prob: float, mask_channel_selection: str, mask_channel_other: float, mask_channel_length: int, no_mask_channel_overlap: bool, mask_channel_min_space: int, skip_masked: bool, skip_nomask: bool, num_classes: int, final_dim: int, feature_grad_mult: Optional[float]) HuBERTPretrainModel [source]
Builds custom
HuBERTPretrainModel
for training from scratchNote
The “feature extractor” below corresponds to ConvFeatureExtractionModel in the original
fairseq
implementation. This is referred as “(convolutional) feature encoder” in the wav2vec 2.0 [Baevski et al., 2020] paper.The “encoder” below corresponds to TransformerEncoder, and this is referred as “Transformer” in the paper.
- Parameters:
extractor_mode (str) –
Operation mode of feature extractor. Valid values are
"group_norm"
or"layer_norm"
. If"group_norm"
, then a single normalization is applied in the first convolution block. Otherwise, all the convolution blocks will have layer normalization.This option corresponds to
extractor_mode
fromfairseq
.extractor_conv_layer_config (list of python:integer tuples or None) –
Configuration of convolution layers in feature extractor. List of convolution configuration, i.e.
[(output_channel, kernel_size, stride), ...]
If
None
is provided, then the following default value is used.[ (512, 10, 5), (512, 3, 2), (512, 3, 2), (512, 3, 2), (512, 3, 2), (512, 2, 2), (512, 2, 2), ]
This option corresponds to
conv_feature_layers
fromfairseq
.extractor_conv_bias (bool) –
Whether to include bias term to each convolution operation.
This option corresponds to
conv_bias
fromfairseq
.encoder_embed_dim (int) –
The dimension of embedding in encoder.
This option corresponds to
encoder_embed_dim
fromfairseq
.encoder_projection_dropout (float) –
The dropout probability applied after the input feature is projected to
encoder_embed_dim
.This option corresponds to
dropout_input
fromfairseq
.encoder_pos_conv_kernel (int) –
The kernel size of convolutional positional embeddings.
This option corresponds to
conv_pos
fromfairseq
.encoder_pos_conv_groups (int) –
The number of groups of convolutional positional embeddings.
This option corresponds to
conv_pos_groups
fromfairseq
.encoder_num_layers (int) –
The number of self attention layers in transformer block.
This option corresponds to
encoder_layers
fromfairseq
.encoder_num_heads (int) –
The number of heads in self attention layers.
This option corresponds to
encoder_attention_heads
fromfairseq
.encoder_attention_dropout (float) –
The dropout probability applied after softmax in self-attention layer.
This option corresponds to
attention_dropout
fromfairseq
.encoder_ff_interm_features (int) –
The dimension of hidden features in feed forward layer.
This option corresponds to
encoder_ffn_embed_dim
fromfairseq
.encoder_ff_interm_dropout (float) –
The dropout probability applied in feedforward layer.
This option correspinds to
activation_dropout
fromfairseq
.encoder_dropout (float) –
The dropout probability applied at the end of feed forward layer.
This option corresponds to
dropout
fromfairseq
.encoder_layer_norm_first (bool) –
Control the order of layer norm in transformer layer and each encoder layer. If True, in transformer layer, layer norm is applied before features are fed to encoder layers. In encoder layer, two layer norms are applied before and after self attention. If False, in transformer layer, layer norm is applied after features are fed to encoder layers. In encoder layer, two layer norms are applied after self attention, before and after feed forward.
This option corresponds to
layer_norm_first
fromfairseq
.encoder_layer_drop (float) –
Probability to drop each encoder layer during training.
This option corresponds to
layerdrop
fromfairseq
.mask_prob (float) –
Probability for each token to be chosen as start of the span to be masked. this will be multiplied by number of timesteps divided by length of mask span to mask approximately this percentage of all elements. However due to overlaps, the actual number will be smaller (unless no_overlap is True).
This option corresponds to
mask_prob
fromfairseq
.mask_selection (str) –
How to choose the mask length. Options: [
static
,uniform
,normal
,poisson
].This option corresponds to
mask_selection
fromfairseq
.mask_other (float) –
Secondary mask argument (used for more complex distributions).
This option corresponds to
mask_other
fromfairseq
.mask_length (int) –
The lengths of the mask.
This option corresponds to
mask_length
fromfairseq
.no_mask_overlap (bool) –
Whether to allow masks to overlap.
This option corresponds to
no_mask_overlap
fromfairseq
.mask_min_space (int) –
Minimum space between spans (if no overlap is enabled).
This option corresponds to
mask_min_space
fromfairseq
.mask_channel_prob –
(float): The probability of replacing a feature with 0.
This option corresponds to
mask_channel_prob
fromfairseq
.mask_channel_selection (str) –
How to choose the mask length for channel masking. Options: [
static
,uniform
,normal
,poisson
].This option corresponds to
mask_channel_selection
fromfairseq
.mask_channel_other (float) –
Secondary mask argument for channel masking(used for more complex distributions).
This option corresponds to
mask_channel_other
fromfairseq
.mask_channel_length (int) –
Minimum space between spans (if no overlap is enabled) for channel masking.
This option corresponds to
mask_channel_length
fromfairseq
.no_mask_channel_overlap (bool) –
Whether to allow channel masks to overlap.
This option corresponds to
no_mask_channel_overlap
fromfairseq
.mask_channel_min_space (int) –
Minimum space between spans for channel masking(if no overlap is enabled).
This option corresponds to
mask_channel_min_space
fromfairseq
.skip_masked (bool) –
If True, skip computing losses over masked frames.
This option corresponds to
skip_masked
fromfairseq
.skip_nomask (bool) –
If True, skip computing losses over unmasked frames.
This option corresponds to
skip_nomask
fromfairseq
.num_classes (int) – The number of classes in the labels.
final_dim (int) –
Project final representations and targets to final_dim.
This option corresponds to
final_dim
fromfairseq
.feature_grad_mult (float or None) –
The factor to scale the convolutional feature extraction layer gradients by. The scale factor will not affect the forward pass.
This option corresponds to
feature_grad_mult
fromfairseq
.
- Returns:
The resulting model.
- Return type: