Shortcuts

Wav2Vec2Model

class torchaudio.models.Wav2Vec2Model(feature_extractor: Module, encoder: Module, aux: Optional[Module] = None)[source]

Acoustic model used in wav2vec 2.0 [Baevski et al., 2020].

Note

To build the model, please use one of the factory functions.

See also

Parameters:
  • feature_extractor (torch.nn.Module) – Feature extractor that extracts feature vectors from raw audio Tensor.

  • encoder (torch.nn.Module) – Encoder that converts the audio features into the sequence of probability distribution (in negative log-likelihood) over labels.

  • aux (torch.nn.Module or None, optional) – Auxiliary module. If provided, the output from encoder is passed to this module.

Tutorials using Wav2Vec2Model:
Speech Recognition with Wav2Vec2

Speech Recognition with Wav2Vec2

Speech Recognition with Wav2Vec2
ASR Inference with CTC Decoder

ASR Inference with CTC Decoder

ASR Inference with CTC Decoder
Forced Alignment with Wav2Vec2

Forced Alignment with Wav2Vec2

Forced Alignment with Wav2Vec2

Methods

forward

Wav2Vec2Model.forward(waveforms: Tensor, lengths: Optional[Tensor] = None) Tuple[Tensor, Optional[Tensor]][source]

Compute the sequence of probability distribution over labels.

Parameters:
  • waveforms (Tensor) – Audio tensor of shape (batch, frames).

  • lengths (Tensor or None, optional) – Indicates the valid length of each audio in the batch. Shape: (batch, ). When the waveforms contains audios with different durations, by providing lengths argument, the model will compute the corresponding valid output lengths and apply proper mask in transformer attention layer. If None, it is assumed that all the audio in waveforms have valid length. Default: None.

Returns:

Tensor

The sequences of probability distribution (in logit) over labels. Shape: (batch, frames, num labels).

Tensor or None

If lengths argument was provided, a Tensor of shape (batch, ) is returned. It indicates the valid length in time axis of the output Tensor.

Return type:

(Tensor, Optional[Tensor])

extract_features

Wav2Vec2Model.extract_features(waveforms: Tensor, lengths: Optional[Tensor] = None, num_layers: Optional[int] = None) Tuple[List[Tensor], Optional[Tensor]][source]

Extract feature vectors from raw waveforms

This returns the list of outputs from the intermediate layers of transformer block in encoder.

Parameters:
  • waveforms (Tensor) – Audio tensor of shape (batch, frames).

  • lengths (Tensor or None, optional) – Indicates the valid length of each audio in the batch. Shape: (batch, ). When the waveforms contains audios with different durations, by providing lengths argument, the model will compute the corresponding valid output lengths and apply proper mask in transformer attention layer. If None, it is assumed that the entire audio waveform length is valid.

  • num_layers (int or None, optional) – If given, limit the number of intermediate layers to go through. Providing 1 will stop the computation after going through one intermediate layers. If not given, the outputs from all the intermediate layers are returned.

Returns:

List of Tensors

Features from requested layers. Each Tensor is of shape: (batch, time frame, feature dimension)

Tensor or None

If lengths argument was provided, a Tensor of shape (batch, ) is returned. It indicates the valid length in time axis of each feature Tensor.

Return type:

(List[Tensor], Optional[Tensor])

Factory Functions

wav2vec2_model

Builds custom Wav2Vec2Model.

wav2vec2_base

Builds "base" Wav2Vec2Model from wav2vec 2.0 [Baevski et al., 2020]

wav2vec2_large

Builds "large" Wav2Vec2Model from wav2vec 2.0 [Baevski et al., 2020]

wav2vec2_large_lv60k

Builds "large lv-60k" Wav2Vec2Model from wav2vec 2.0 [Baevski et al., 2020]

wav2vec2_xlsr_300m

Builds XLS-R model [Babu et al., 2021] with 300 millions of parameters.

wav2vec2_xlsr_1b

Builds XLS-R model [Babu et al., 2021] with 1 billion of parameters.

wav2vec2_xlsr_2b

Builds XLS-R model [Babu et al., 2021] with 2 billions of parameters.

hubert_base

Builds "base" HuBERT from HuBERT [Hsu et al., 2021]

hubert_large

Builds "large" HuBERT from HuBERT [Hsu et al., 2021]

hubert_xlarge

Builds "extra large" HuBERT from HuBERT [Hsu et al., 2021]

wavlm_model

Builds custom WaveLM model [Chen et al., 2022].

wavlm_base

Builds "base" WaveLM model [Chen et al., 2022].

wavlm_large

Builds "large" WaveLM model [Chen et al., 2022].

Prototype Factory Functions

emformer_hubert_model

Build a custom Emformer HuBERT model.

emformer_hubert_base

Build Emformer HuBERT Model with 20 Emformer layers.

conformer_wav2vec2_model

Build a custom Conformer Wav2Vec2Model

conformer_wav2vec2_base

Build Conformer Wav2Vec2 Model with "small" architecture from Conformer-Based Slef-Supervised Learning for Non-Speech Audio Tasks [Srivastava et al., 2022]

Utility Functions

import_fairseq_model

Builds Wav2Vec2Model from the corresponding model object of fairseq.

import_huggingface_model

Builds Wav2Vec2Model from the corresponding model object of Transformers.

Docs

Access comprehensive developer documentation for PyTorch

View Docs

Tutorials

Get in-depth tutorials for beginners and advanced developers

View Tutorials

Resources

Find development resources and get your questions answered

View Resources