torchaudio.pipelines

The torchaudio.pipelines module packages pre-trained models with support functions and meta-data into simple APIs tailored to perform specific tasks.

When using pre-trained models to perform a task, in addition to instantiating the model with pre-trained weights, the client code also needs to build pipelines for feature extractions and post processing in the same way they were done during the training. This requires to carrying over information used during the training, such as the type of transforms and the their parameters (for example, sampling rate the number of FFT bins).

To make this information tied to a pre-trained model and easily accessible, torchaudio.pipelines module uses the concept of a Bundle class, which defines a set of APIs to instantiate pipelines, and the interface of the pipelines.

The following figure illustrates this.

https://download.pytorch.org/torchaudio/doc-assets/pipelines-intro.png

A pre-trained model and associated pipelines are expressed as an instance of Bundle. Different instances of same Bundle share the interface, but their implementations are not constrained to be of same types. For example, SourceSeparationBundle defines the interface for performing source separation, but its instance CONVTASNET_BASE_LIBRI2MIX instantiates a model of ConvTasNet while HDEMUCS_HIGH_MUSDB instantiates a model of HDemucs. Still, because they share the same interface, the usage is the same.

Note

Under the hood, the implementations of Bundle use components from other torchaudio modules, such as torchaudio.models and torchaudio.transforms, or even third party libraries like SentencPiece and DeepPhonemizer. But this implementation detail is abstracted away from library users.

RNN-T Streaming/Non-Streaming ASR

Interface

RNNTBundle defines ASR pipelines and consists of three steps: feature extraction, inference, and de-tokenization.

https://download.pytorch.org/torchaudio/doc-assets/pipelines-rnntbundle.png

`RNNTBundle`	Dataclass that bundles components for performing automatic speech recognition (ASR, speech-to-text) inference with an RNN-T model.
`RNNTBundle.FeatureExtractor`	Interface of the feature extraction part of RNN-T pipeline
`RNNTBundle.TokenProcessor`	Interface of the token processor part of RNN-T pipeline

Tutorials using RNNTBundle

Device ASR with Emformer RNN-T

Online ASR with Emformer RNN-T

Pretrained Models

EMFORMER_RNNT_BASE_LIBRISPEECH

ASR pipeline based on Emformer-RNNT, pretrained on LibriSpeech dataset [Panayotov et al., 2015], capable of performing both streaming and non-streaming inference.

wav2vec 2.0 / HuBERT / WavLM - SSL

Interface

Wav2Vec2Bundle instantiates models that generate acoustic features that can be used for downstream inference and fine-tuning.

Wav2Vec2Bundle

Data class that bundles associated information to use pretrained Wav2Vec2Model.

Pretrained Models

`WAV2VEC2_BASE`	Wav2vec 2.0 model ("base" architecture), pre-trained on 960 hours of unlabeled audio from LibriSpeech dataset [Panayotov et al., 2015] (the combination of "train-clean-100", "train-clean-360", and "train-other-500"), not fine-tuned.
`WAV2VEC2_LARGE`	Wav2vec 2.0 model ("large" architecture), pre-trained on 960 hours of unlabeled audio from LibriSpeech dataset [Panayotov et al., 2015] (the combination of "train-clean-100", "train-clean-360", and "train-other-500"), not fine-tuned.
`WAV2VEC2_LARGE_LV60K`	Wav2vec 2.0 model ("large-lv60k" architecture), pre-trained on 60,000 hours of unlabeled audio from Libri-Light dataset [Kahn et al., 2020], not fine-tuned.
`WAV2VEC2_XLSR53`	Wav2vec 2.0 model ("base" architecture), pre-trained on 56,000 hours of unlabeled audio from multiple datasets ( Multilingual LibriSpeech [Pratap et al., 2020], CommonVoice [Ardila et al., 2020] and BABEL [Gales et al., 2014]), not fine-tuned.
`WAV2VEC2_XLSR_300M`	XLS-R model with 300 million parameters, pre-trained on 436,000 hours of unlabeled audio from multiple datasets ( Multilingual LibriSpeech [Pratap et al., 2020], CommonVoice [Ardila et al., 2020], VoxLingua107 [Valk and Alumäe, 2021], BABEL [Gales et al., 2014], and VoxPopuli [Wang et al., 2021]) in 128 languages, not fine-tuned.
`WAV2VEC2_XLSR_1B`	XLS-R model with 1 billion parameters, pre-trained on 436,000 hours of unlabeled audio from multiple datasets ( Multilingual LibriSpeech [Pratap et al., 2020], CommonVoice [Ardila et al., 2020], VoxLingua107 [Valk and Alumäe, 2021], BABEL [Gales et al., 2014], and VoxPopuli [Wang et al., 2021]) in 128 languages, not fine-tuned.
`WAV2VEC2_XLSR_2B`	XLS-R model with 2 billion parameters, pre-trained on 436,000 hours of unlabeled audio from multiple datasets ( Multilingual LibriSpeech [Pratap et al., 2020], CommonVoice [Ardila et al., 2020], VoxLingua107 [Valk and Alumäe, 2021], BABEL [Gales et al., 2014], and VoxPopuli [Wang et al., 2021]) in 128 languages, not fine-tuned.
`HUBERT_BASE`	HuBERT model ("base" architecture), pre-trained on 960 hours of unlabeled audio from LibriSpeech dataset [Panayotov et al., 2015] (the combination of "train-clean-100", "train-clean-360", and "train-other-500"), not fine-tuned.
`HUBERT_LARGE`	HuBERT model ("large" architecture), pre-trained on 60,000 hours of unlabeled audio from Libri-Light dataset [Kahn et al., 2020], not fine-tuned.
`HUBERT_XLARGE`	HuBERT model ("extra large" architecture), pre-trained on 60,000 hours of unlabeled audio from Libri-Light dataset [Kahn et al., 2020], not fine-tuned.
`WAVLM_BASE`	WavLM Base model ("base" architecture), pre-trained on 960 hours of unlabeled audio from LibriSpeech dataset [Panayotov et al., 2015], not fine-tuned.
`WAVLM_BASE_PLUS`	WavLM Base+ model ("base" architecture), pre-trained on 60,000 hours of Libri-Light dataset [Kahn et al., 2020], 10,000 hours of GigaSpeech [Chen et al., 2021], and 24,000 hours of VoxPopuli [Wang et al., 2021], not fine-tuned.
`WAVLM_LARGE`	WavLM Large model ("large" architecture), pre-trained on 60,000 hours of Libri-Light dataset [Kahn et al., 2020], 10,000 hours of GigaSpeech [Chen et al., 2021], and 24,000 hours of VoxPopuli [Wang et al., 2021], not fine-tuned.

wav2vec 2.0 / HuBERT - Fine-tuned ASR

Interface

Wav2Vec2ASRBundle instantiates models that generate probability distribution over pre-defined labels, that can be used for ASR.

Wav2Vec2ASRBundle

Data class that bundles associated information to use pretrained Wav2Vec2Model.

Tutorials using Wav2Vec2ASRBundle

Speech Recognition with Wav2Vec2

ASR Inference with CTC Decoder

Forced Alignment with Wav2Vec2

Pretrained Models

`WAV2VEC2_ASR_BASE_10M`	Wav2vec 2.0 model ("base" architecture with an extra linear module), pre-trained on 960 hours of unlabeled audio from LibriSpeech dataset [Panayotov et al., 2015] (the combination of "train-clean-100", "train-clean-360", and "train-other-500"), and fine-tuned for ASR on 10 minutes of transcribed audio from Libri-Light dataset [Kahn et al., 2020] ("train-10min" subset).
`WAV2VEC2_ASR_BASE_100H`	Wav2vec 2.0 model ("base" architecture with an extra linear module), pre-trained on 960 hours of unlabeled audio from LibriSpeech dataset [Panayotov et al., 2015] (the combination of "train-clean-100", "train-clean-360", and "train-other-500"), and fine-tuned for ASR on 100 hours of transcribed audio from "train-clean-100" subset.
`WAV2VEC2_ASR_BASE_960H`	Wav2vec 2.0 model ("base" architecture with an extra linear module), pre-trained on 960 hours of unlabeled audio from LibriSpeech dataset [Panayotov et al., 2015] (the combination of "train-clean-100", "train-clean-360", and "train-other-500"), and fine-tuned for ASR on the same audio with the corresponding transcripts.
`WAV2VEC2_ASR_LARGE_10M`	Wav2vec 2.0 model ("large" architecture with an extra linear module), pre-trained on 960 hours of unlabeled audio from LibriSpeech dataset [Panayotov et al., 2015] (the combination of "train-clean-100", "train-clean-360", and "train-other-500"), and fine-tuned for ASR on 10 minutes of transcribed audio from Libri-Light dataset [Kahn et al., 2020] ("train-10min" subset).
`WAV2VEC2_ASR_LARGE_100H`	Wav2vec 2.0 model ("large" architecture with an extra linear module), pre-trained on 960 hours of unlabeled audio from LibriSpeech dataset [Panayotov et al., 2015] (the combination of "train-clean-100", "train-clean-360", and "train-other-500"), and fine-tuned for ASR on 100 hours of transcribed audio from the same dataset ("train-clean-100" subset).
`WAV2VEC2_ASR_LARGE_960H`	Wav2vec 2.0 model ("large" architecture with an extra linear module), pre-trained on 960 hours of unlabeled audio from LibriSpeech dataset [Panayotov et al., 2015] (the combination of "train-clean-100", "train-clean-360", and "train-other-500"), and fine-tuned for ASR on the same audio with the corresponding transcripts.
`WAV2VEC2_ASR_LARGE_LV60K_10M`	Wav2vec 2.0 model ("large-lv60k" architecture with an extra linear module), pre-trained on 60,000 hours of unlabeled audio from Libri-Light dataset [Kahn et al., 2020], and fine-tuned for ASR on 10 minutes of transcribed audio from the same dataset ("train-10min" subset).
`WAV2VEC2_ASR_LARGE_LV60K_100H`	Wav2vec 2.0 model ("large-lv60k" architecture with an extra linear module), pre-trained on 60,000 hours of unlabeled audio from Libri-Light dataset [Kahn et al., 2020], and fine-tuned for ASR on 100 hours of transcribed audio from LibriSpeech dataset [Panayotov et al., 2015] ("train-clean-100" subset).
`WAV2VEC2_ASR_LARGE_LV60K_960H`	Wav2vec 2.0 model ("large-lv60k" architecture with an extra linear module), pre-trained on 60,000 hours of unlabeled audio from Libri-Light [Kahn et al., 2020] dataset, and fine-tuned for ASR on 960 hours of transcribed audio from LibriSpeech dataset [Panayotov et al., 2015] (the combination of "train-clean-100", "train-clean-360", and "train-other-500").
`VOXPOPULI_ASR_BASE_10K_DE`	wav2vec 2.0 model ("base" architecture), pre-trained on 10k hours of unlabeled audio from VoxPopuli dataset [Wang et al., 2021] ("10k" subset, consisting of 23 languages), and fine-tuned for ASR on 282 hours of transcribed audio from "de" subset.
`VOXPOPULI_ASR_BASE_10K_EN`	wav2vec 2.0 model ("base" architecture), pre-trained on 10k hours of unlabeled audio from VoxPopuli dataset [Wang et al., 2021] ("10k" subset, consisting of 23 languages), and fine-tuned for ASR on 543 hours of transcribed audio from "en" subset.
`VOXPOPULI_ASR_BASE_10K_ES`	wav2vec 2.0 model ("base" architecture), pre-trained on 10k hours of unlabeled audio from VoxPopuli dataset [Wang et al., 2021] ("10k" subset, consisting of 23 languages), and fine-tuned for ASR on 166 hours of transcribed audio from "es" subset.
`VOXPOPULI_ASR_BASE_10K_FR`	wav2vec 2.0 model ("base" architecture), pre-trained on 10k hours of unlabeled audio from VoxPopuli dataset [Wang et al., 2021] ("10k" subset, consisting of 23 languages), and fine-tuned for ASR on 211 hours of transcribed audio from "fr" subset.
`VOXPOPULI_ASR_BASE_10K_IT`	wav2vec 2.0 model ("base" architecture), pre-trained on 10k hours of unlabeled audio from VoxPopuli dataset [Wang et al., 2021] ("10k" subset, consisting of 23 languages), and fine-tuned for ASR on 91 hours of transcribed audio from "it" subset.
`HUBERT_ASR_LARGE`	HuBERT model ("large" architecture), pre-trained on 60,000 hours of unlabeled audio from Libri-Light dataset [Kahn et al., 2020], and fine-tuned for ASR on 960 hours of transcribed audio from LibriSpeech dataset [Panayotov et al., 2015] (the combination of "train-clean-100", "train-clean-360", and "train-other-500").
`HUBERT_ASR_XLARGE`	HuBERT model ("extra large" architecture), pre-trained on 60,000 hours of unlabeled audio from Libri-Light dataset [Kahn et al., 2020], and fine-tuned for ASR on 960 hours of transcribed audio from LibriSpeech dataset [Panayotov et al., 2015] (the combination of "train-clean-100", "train-clean-360", and "train-other-500").

Tacotron2 Text-To-Speech

Tacotron2TTSBundle defines text-to-speech pipelines and consists of three steps: tokenization, spectrogram generation and vocoder. The spectrogram generation is based on Tacotron2 model.

https://download.pytorch.org/torchaudio/doc-assets/pipelines-tacotron2bundle.png

TextProcessor can be rule-based tokenization in the case of characters, or it can be a neural-netowrk-based G2P model that generates sequence of phonemes from input text.

Similarly Vocoder can be an algorithm without learning parameters, like Griffin-Lim, or a neural-network-based model like Waveglow.

Interface

`Tacotron2TTSBundle`	Data class that bundles associated information to use pretrained Tacotron2 and vocoder.
`Tacotron2TTSBundle.TextProcessor`	Interface of the text processing part of Tacotron2TTS pipeline
`Tacotron2TTSBundle.Vocoder`	Interface of the vocoder part of Tacotron2TTS pipeline

Tutorials using Tacotron2TTSBundle

Text-to-Speech with Tacotron2

Pretrained Models

`TACOTRON2_WAVERNN_PHONE_LJSPEECH`	Phoneme-based TTS pipeline with `Tacotron2` trained on LJSpeech [Ito and Johnson, 2017] for 1,500 epochs, and `WaveRNN` vocoder trained on 8 bits depth waveform of LJSpeech [Ito and Johnson, 2017] for 10,000 epochs.
`TACOTRON2_WAVERNN_CHAR_LJSPEECH`	Character-based TTS pipeline with `Tacotron2` trained on LJSpeech [Ito and Johnson, 2017] for 1,500 epochs and `WaveRNN` vocoder trained on 8 bits depth waveform of LJSpeech [Ito and Johnson, 2017] for 10,000 epochs.
`TACOTRON2_GRIFFINLIM_PHONE_LJSPEECH`	Phoneme-based TTS pipeline with `Tacotron2` trained on LJSpeech [Ito and Johnson, 2017] for 1,500 epochs and `GriffinLim` as vocoder.
`TACOTRON2_GRIFFINLIM_CHAR_LJSPEECH`	Character-based TTS pipeline with `Tacotron2` trained on LJSpeech [Ito and Johnson, 2017] for 1,500 epochs, and `GriffinLim` as vocoder.

Source Separation

Interface

SourceSeparationBundle instantiates source separation models which take single channel audio and generates multi-channel audio.

SourceSeparationBundle

Dataclass that bundles components for performing source separation.

Tutorials using SourceSeparationBundle

Music Source Separation with Hybrid Demucs

Pretrained Models

`CONVTASNET_BASE_LIBRI2MIX`	Pre-trained Source Separation pipeline with ConvTasNet [Luo and Mesgarani, 2019] trained on Libri2Mix dataset [Cosentino et al., 2020].
`HDEMUCS_HIGH_MUSDB_PLUS`	Pre-trained music source separation pipeline with Hybrid Demucs [Défossez, 2021] trained on both training and test sets of MUSDB-HQ [Rafii et al., 2019] and an additional 150 extra songs from an internal database that was specifically produced for Meta.
`HDEMUCS_HIGH_MUSDB`	Pre-trained music source separation pipeline with Hybrid Demucs [Défossez, 2021] trained on the training set of MUSDB-HQ [Rafii et al., 2019].

torchaudio.pipelines

RNN-T Streaming/Non-Streaming ASR

Interface

Pretrained Models

wav2vec 2.0 / HuBERT / WavLM - SSL

Interface

Pretrained Models

wav2vec 2.0 / HuBERT - Fine-tuned ASR

Interface

Pretrained Models

Tacotron2 Text-To-Speech

Interface

Pretrained Models

Source Separation

Interface

Pretrained Models

Docs

Tutorials

Resources