Wav2Vec2Model

class torchaudio.models.Wav2Vec2Model(feature_extractor: Module, encoder: Module, aux: Optional[Module] = None)[source]

Acoustic model used in wav2vec 2.0 [Baevski et al., 2020].

Note

To build the model, please use one of the factory functions.

Methods

Wav2Vec2Model.forward(waveforms: Tensor, lengths: Optional[Tensor] = None) → Tuple[Tensor, Optional[Tensor]][source]

Compute the sequence of probability distribution over labels.

Parameters:

waveforms (Tensor) – Audio tensor of shape (batch, frames).
lengths (Tensor or None, optional) – Indicates the valid length of each audio in the batch. Shape: (batch, ). When the waveforms contains audios with different durations, by providing lengths argument, the model will compute the corresponding valid output lengths and apply proper mask in transformer attention layer. If None, it is assumed that all the audio in waveforms have valid length. Default: None.

Returns:

Tensor: The sequences of probability distribution (in logit) over labels. Shape: (batch, frames, num labels).
Tensor or None: If lengths argument was provided, a Tensor of shape (batch, ) is returned. It indicates the valid length in time axis of the output Tensor.

Return type:

(Tensor, Optional[Tensor])

Wav2Vec2Model.extract_features(waveforms: Tensor, lengths: Optional[Tensor] = None, num_layers: Optional[int] = None) → Tuple[List[Tensor], Optional[Tensor]][source]

Extract feature vectors from raw waveforms

This returns the list of outputs from the intermediate layers of transformer block in encoder.

Parameters:

waveforms (Tensor) – Audio tensor of shape (batch, frames).
lengths (Tensor or None, optional) – Indicates the valid length of each audio in the batch. Shape: (batch, ). When the waveforms contains audios with different durations, by providing lengths argument, the model will compute the corresponding valid output lengths and apply proper mask in transformer attention layer. If None, it is assumed that the entire audio waveform length is valid.
num_layers (int or None, optional) – If given, limit the number of intermediate layers to go through. Providing 1 will stop the computation after going through one intermediate layers. If not given, the outputs from all the intermediate layers are returned.

Returns:

List of Tensors: Features from requested layers. Each Tensor is of shape: (batch, time frame, feature dimension)
Tensor or None: If lengths argument was provided, a Tensor of shape (batch, ) is returned. It indicates the valid length in time axis of each feature Tensor.

Return type:

(List[Tensor], Optional[Tensor])

`wav2vec2_model`	Builds custom `Wav2Vec2Model`.
`wav2vec2_base`	Builds "base" `Wav2Vec2Model` from wav2vec 2.0 [Baevski et al., 2020]
`wav2vec2_large`	Builds "large" `Wav2Vec2Model` from wav2vec 2.0 [Baevski et al., 2020]
`wav2vec2_large_lv60k`	Builds "large lv-60k" `Wav2Vec2Model` from wav2vec 2.0 [Baevski et al., 2020]
`hubert_base`	Builds "base" `HuBERT` from HuBERT [Hsu et al., 2021]
`hubert_large`	Builds "large" `HuBERT` from HuBERT [Hsu et al., 2021]
`hubert_xlarge`	Builds "extra large" `HuBERT` from HuBERT [Hsu et al., 2021]

`import_fairseq_model`	Builds `Wav2Vec2Model` from the corresponding model object of fairseq.
`import_huggingface_model`	Builds `Wav2Vec2Model` from the corresponding model object of Transformers.