torchaudio.pipelines¶

The pipelines subpackage contains API to access the models with pretrained weights, and information/helper functions associated the pretrained weights.

wav2vec 2.0 / HuBERT - Representation Learning¶

class torchaudio.pipelines.Wav2Vec2Bundle[source]¶

Data class that bundles associated information to use pretrained Wav2Vec2Model.

This class provides interfaces for instantiating the pretrained model along with the information necessary to retrieve pretrained weights and additional data to be used with the model.

Torchaudio library instantiates objects of this class, each of which represents a different pretrained model. Client code should access pretrained models via these instances.

Please see below for the usage and the available values.

Example - Feature Extraction

>>> import torchaudio
>>>
>>> bundle = torchaudio.pipelines.HUBERT_BASE
>>>
>>> # Build the model and load pretrained weight.
>>> model = bundle.get_model()
Downloading:
100%|███████████████████████████████| 360M/360M [00:06<00:00, 60.6MB/s]
>>>
>>> # Resample audio to the expected sampling rate
>>> waveform = torchaudio.functional.resample(waveform, sample_rate, bundle.sample_rate)
>>>
>>> # Extract acoustic features
>>> features, _ = model.extract_features(waveform)

get_model(self, *, dl_kwargs=None) → torchaudio.models.Wav2Vec2Model [source]¶

Construct the model and load the pretrained weight.

The weight file is downloaded from the internet and cached with torch.hub.load_state_dict_from_url()

Parameters: dl_kwargs (dictionary of keyword arguments) – Passed to torch.hub.load_state_dict_from_url().

property sample_rate¶

Sample rate of the audio that the model is trained on.

Type: float

WAV2VEC2_BASE¶

torchaudio.pipelines.WAV2VEC2_BASE¶

wav2vec 2.0 model with “Base” configuration.

Pre-trained on 960 hours of unlabeled audio from LibriSpeech dataset [1] (the combination of “train-clean-100”, “train-clean-360”, and “train-other-500”). Not fine-tuned.

Originally published by the authors of wav2vec 2.0 [2] under MIT License and redistributed with the same license. [License, Source]

Please refer to torchaudio.pipelines.Wav2Vec2Bundle() for the usage.

WAV2VEC2_LARGE¶

torchaudio.pipelines.WAV2VEC2_LARGE¶

Build “large” wav2vec2 model.

Pre-trained on 960 hours of unlabeled audio from LibriSpeech dataset [1] (the combination of “train-clean-100”, “train-clean-360”, and “train-other-500”). Not fine-tuned.

Originally published by the authors of wav2vec 2.0 [2] under MIT License and redistributed with the same license. [License, Source]

Please refer to torchaudio.pipelines.Wav2Vec2Bundle() for the usage.

WAV2VEC2_LARGE_LV60K¶

torchaudio.pipelines.WAV2VEC2_LARGE_LV60K¶

Build “large-lv60k” wav2vec2 model.

Pre-trained on 60,000 hours of unlabeled audio from Libri-Light dataset [3]. Not fine-tuned.

Originally published by the authors of wav2vec 2.0 [2] under MIT License and redistributed with the same license. [License, Source]

Please refer to torchaudio.pipelines.Wav2Vec2Bundle() for the usage.

WAV2VEC2_XLSR53¶

torchaudio.pipelines.WAV2VEC2_XLSR53¶

wav2vec 2.0 model with “Base” configuration.

Trained on 56,000 hours of unlabeled audio from multiple datasets ( Multilingual LibriSpeech [4], CommonVoice [5] and BABEL [6]). Not fine-tuned.

Originally published by the authors of Unsupervised Cross-lingual Representation Learning for Speech Recognition [7] under MIT License and redistributed with the same license. [License, Source]

Please refer to torchaudio.pipelines.Wav2Vec2Bundle() for the usage.

HUBERT_BASE¶

torchaudio.pipelines.HUBERT_BASE¶

HuBERT model with “Base” configuration.

Pre-trained on 960 hours of unlabeled audio from LibriSpeech dataset [1] (the combination of “train-clean-100”, “train-clean-360”, and “train-other-500”). Not fine-tuned.

Originally published by the authors of HuBERT [8] under MIT License and redistributed with the same license. [License, Source]

Please refer to torchaudio.pipelines.Wav2Vec2Bundle() for the usage.

HUBERT_LARGE¶

torchaudio.pipelines.HUBERT_LARGE¶

HuBERT model with “Large” configuration.

Pre-trained on 60,000 hours of unlabeled audio from Libri-Light dataset [3]. Not fine-tuned.

Originally published by the authors of HuBERT [8] under MIT License and redistributed with the same license. [License, Source]

Please refer to torchaudio.pipelines.Wav2Vec2Bundle() for the usage.

HUBERT_XLARGE¶

torchaudio.pipelines.HUBERT_XLARGE¶

HuBERT model with “Extra Large” configuration.

Pre-trained on 60,000 hours of unlabeled audio from Libri-Light dataset [3]. Not fine-tuned.

Originally published by the authors of HuBERT [8] under MIT License and redistributed with the same license. [License, Source]

Please refer to torchaudio.pipelines.Wav2Vec2Bundle() for the usage.

wav2vec 2.0 / HuBERT - Fine-tuned ASR¶

Wav2Vec2ASRBundle¶

class torchaudio.pipelines.Wav2Vec2ASRBundle[source]¶

Data class that bundles associated information to use pretrained Wav2Vec2Model.

This class provides interfaces for instantiating the pretrained model along with the information necessary to retrieve pretrained weights and additional data to be used with the model.

Torchaudio library instantiates objects of this class, each of which represents a different pretrained model. Client code should access pretrained models via these instances.

Please see below for the usage and the available values.

Example - ASR

>>> import torchaudio
>>>
>>> bundle = torchaudio.pipelines.HUBERT_ASR_LARGE
>>>
>>> # Build the model and load pretrained weight.
>>> model = bundle.get_model()
Downloading:
100%|███████████████████████████████| 1.18G/1.18G [00:17<00:00, 73.8MB/s]
>>>
>>> # Check the corresponding labels of the output.
>>> labels = bundle.get_labels()
>>> print(labels)
('<s>', '<pad>', '</s>', '<unk>', '|', 'E', 'T', 'A', 'O', 'N', 'I', 'H', 'S', 'R', 'D', 'L', 'U', 'M', 'W', 'C', 'F', 'G', 'Y', 'P', 'B', 'V', 'K', "'", 'X', 'J', 'Q', 'Z')
>>>
>>> # Resample audio to the expected sampling rate
>>> waveform = torchaudio.functional.resample(waveform, sample_rate, bundle.sample_rate)
>>>
>>> # Infer the label probability distribution
>>> emissions, _ = model(waveform)
>>>
>>> # Pass emission to decoder
>>> # `ctc_decode` is for illustration purpose only
>>> transcripts = ctc_decode(emissions, labels)

get_model(self, *, dl_kwargs=None) → torchaudio.models.Wav2Vec2Model ¶

Construct the model and load the pretrained weight.

The weight file is downloaded from the internet and cached with torch.hub.load_state_dict_from_url()

Parameters: dl_kwargs (dictionary of keyword arguments) – Passed to torch.hub.load_state_dict_from_url().

get_labels(*, bos: str = '<s>', pad: str = '<pad>', eos: str = '</s>', unk: str = '<unk>') → Tuple[str][source]¶

The output class labels (only applicable to fine-tuned bundles)

The first four tokens are BOS, padding, EOS and UNK tokens and they can be customized.

Parameters

bos (str, optional) – Beginning of sentence token. (default: '<s>')
pad (str, optional) – Padding token. (default: '<pad>')
eos (str, optional) – End of sentence token. (default: '</s>')
unk (str, optional) – Token for unknown class. (default: '<unk>')

Returns

For models fine-tuned on ASR, returns the tuple of strings representing the output class labels.

Return type

Tuple[str]

Example

>>> import torchaudio
>>> torchaudio.models.HUBERT_ASR_LARGE.get_labels()
('<s>', '<pad>', '</s>', '<unk>', '|', 'E', 'T', 'A', 'O', 'N', 'I', 'H', 'S', 'R', 'D', 'L', 'U', 'M', 'W', 'C', 'F', 'G', 'Y', 'P', 'B', 'V', 'K', "'", 'X', 'J', 'Q', 'Z')

property sample_rate¶

Sample rate of the audio that the model is trained on.

Type: float

Examples using `Wav2Vec2ASRBundle`¶

Speech Recognition with Wav2Vec2¶

Forced Alignment with Wav2Vec2¶

WAV2VEC2_ASR_BASE_10M¶

torchaudio.pipelines.WAV2VEC2_ASR_BASE_10M¶

Build “base” wav2vec2 model with an extra linear module

Pre-trained on 960 hours of unlabeled audio from LibriSpeech dataset [1] (the combination of “train-clean-100”, “train-clean-360”, and “train-other-500”), and fine-tuned for ASR on 10 minutes of transcribed audio from Libri-Light dataset [3] (“train-10min” subset).

Originally published by the authors of wav2vec 2.0 [2] under MIT License and redistributed with the same license. [License, Source]

Please refer to torchaudio.pipelines.Wav2Vec2ASRBundle() for the usage.

WAV2VEC2_ASR_BASE_100H¶

torchaudio.pipelines.WAV2VEC2_ASR_BASE_100H¶

Build “base” wav2vec2 model with an extra linear module

Pre-trained on 960 hours of unlabeled audio from LibriSpeech dataset [1] (the combination of “train-clean-100”, “train-clean-360”, and “train-other-500”), and fine-tuned for ASR on 100 hours of transcribed audio from “train-clean-100” subset.

Originally published by the authors of wav2vec 2.0 [2] under MIT License and redistributed with the same license. [License, Source]

Please refer to torchaudio.pipelines.Wav2Vec2ASRBundle() for the usage.

WAV2VEC2_ASR_BASE_960H¶

torchaudio.pipelines.WAV2VEC2_ASR_BASE_960H¶

Build “base” wav2vec2 model with an extra linear module

Pre-trained on 960 hours of unlabeled audio from LibriSpeech dataset [1] (the combination of “train-clean-100”, “train-clean-360”, and “train-other-500”), and fine-tuned for ASR on the same audio with the corresponding transcripts.

Originally published by the authors of wav2vec 2.0 [2] under MIT License and redistributed with the same license. [License, Source]

Please refer to torchaudio.pipelines.Wav2Vec2ASRBundle() for the usage.

WAV2VEC2_ASR_LARGE_10M¶

torchaudio.pipelines.WAV2VEC2_ASR_LARGE_10M¶

Build “large” wav2vec2 model with an extra linear module

Pre-trained on 960 hours of unlabeled audio from LibriSpeech dataset [1] (the combination of “train-clean-100”, “train-clean-360”, and “train-other-500”), and fine-tuned for ASR on 10 minutes of transcribed audio from Libri-Light dataset [3] (“train-10min” subset).

Originally published by the authors of wav2vec 2.0 [2] under MIT License and redistributed with the same license. [License, Source]

Please refer to torchaudio.pipelines.Wav2Vec2ASRBundle() for the usage.

WAV2VEC2_ASR_LARGE_100H¶

torchaudio.pipelines.WAV2VEC2_ASR_LARGE_100H¶

Build “large” wav2vec2 model with an extra linear module

Pre-trained on 960 hours of unlabeled audio from LibriSpeech dataset [1] (the combination of “train-clean-100”, “train-clean-360”, and “train-other-500”), and fine-tuned for ASR on 100 hours of transcribed audio from the same dataset (“train-clean-100” subset).

Originally published by the authors of wav2vec 2.0 [2] under MIT License and redistributed with the same license. [License, Source]

Please refer to torchaudio.pipelines.Wav2Vec2ASRBundle() for the usage.

WAV2VEC2_ASR_LARGE_960H¶

torchaudio.pipelines.WAV2VEC2_ASR_LARGE_960H¶

Build “large” wav2vec2 model with an extra linear module

Pre-trained on 960 hours of unlabeled audio from LibriSpeech dataset [1] (the combination of “train-clean-100”, “train-clean-360”, and “train-other-500”), and fine-tuned for ASR on the same audio with the corresponding transcripts.

Originally published by the authors of wav2vec 2.0 [2] under MIT License and redistributed with the same license. [License, Source]

Please refer to torchaudio.pipelines.Wav2Vec2ASRBundle() for the usage.

WAV2VEC2_ASR_LARGE_LV60K_10M¶

torchaudio.pipelines.WAV2VEC2_ASR_LARGE_LV60K_10M¶

Build “large-lv60k” wav2vec2 model with an extra linear module

Pre-trained on 60,000 hours of unlabeled audio from Libri-Light dataset [3], and fine-tuned for ASR on 10 minutes of transcribed audio from the same dataset (“train-10min” subset).

Originally published by the authors of wav2vec 2.0 [2] under MIT License and redistributed with the same license. [License, Source]

Please refer to torchaudio.pipelines.Wav2Vec2ASRBundle() for the usage.

WAV2VEC2_ASR_LARGE_LV60K_100H¶

torchaudio.pipelines.WAV2VEC2_ASR_LARGE_LV60K_100H¶

Build “large-lv60k” wav2vec2 model with an extra linear module

Pre-trained on 60,000 hours of unlabeled audio from Libri-Light dataset [3], and fine-tuned for ASR on 100 hours of transcribed audio from LibriSpeech dataset [1] (“train-clean-100” subset).

Originally published by the authors of wav2vec 2.0 [2] under MIT License and redistributed with the same license. [License, Source]

Please refer to torchaudio.pipelines.Wav2Vec2ASRBundle() for the usage.

WAV2VEC2_ASR_LARGE_LV60K_960H¶

torchaudio.pipelines.WAV2VEC2_ASR_LARGE_LV60K_960H¶

Build “large-lv60k” wav2vec2 model with an extra linear module

Pre-trained on 60,000 hours of unlabeled audio from Libri-Light [3] dataset, and fine-tuned for ASR on 960 hours of transcribed audio from LibriSpeech dataset [1] (the combination of “train-clean-100”, “train-clean-360”, and “train-other-500”).

Originally published by the authors of wav2vec 2.0 [2] under MIT License and redistributed with the same license. [License, Source]

Please refer to torchaudio.pipelines.Wav2Vec2ASRBundle() for the usage.

HUBERT_ASR_LARGE¶

torchaudio.pipelines.HUBERT_ASR_LARGE¶

HuBERT model with “Large” configuration.

Pre-trained on 60,000 hours of unlabeled audio from Libri-Light dataset [3], and fine-tuned for ASR on 960 hours of transcribed audio from LibriSpeech dataset [1] (the combination of “train-clean-100”, “train-clean-360”, and “train-other-500”).

Originally published by the authors of HuBERT [8] under MIT License and redistributed with the same license. [License, Source]

Please refer to torchaudio.pipelines.Wav2Vec2ASRBundle() for the usage.

HUBERT_ASR_XLARGE¶

torchaudio.pipelines.HUBERT_ASR_XLARGE¶

HuBERT model with “Extra Large” configuration.

Pre-trained on 60,000 hours of unlabeled audio from Libri-Light dataset [3], and fine-tuned for ASR on 960 hours of transcribed audio from LibriSpeech dataset [1] (the combination of “train-clean-100”, “train-clean-360”, and “train-other-500”).

Originally published by the authors of HuBERT [8] under MIT License and redistributed with the same license. [License, Source]

Please refer to torchaudio.pipelines.Wav2Vec2ASRBundle() for the usage.

Tacotron2 Text-To-Speech¶

Tacotron2TTSBundle¶

class torchaudio.pipelines.Tacotron2TTSBundle[source]¶

Data class that bundles associated information to use pretrained Tacotron2 and vocoder.

This class provides interfaces for instantiating the pretrained model along with the information necessary to retrieve pretrained weights and additional data to be used with the model.

Torchaudio library instantiates objects of this class, each of which represents a different pretrained model. Client code should access pretrained models via these instances.

Please see below for the usage and the available values.

Example - Character-based TTS pipeline with Tacotron2 and WaveRNN

>>> import torchaudio
>>>
>>> text = "Hello, T T S !"
>>> bundle = torchaudio.pipelines.TACOTRON2_WAVERNN_CHAR_LJSPEECH
>>>
>>> # Build processor, Tacotron2 and WaveRNN model
>>> processor = bundle.get_text_processor()
>>> tacotron2 = bundle.get_tacotron2()
Downloading:
100%|███████████████████████████████| 107M/107M [00:01<00:00, 87.9MB/s]
>>> vocoder = bundle.get_vocoder()
Downloading:
100%|███████████████████████████████| 16.7M/16.7M [00:00<00:00, 78.1MB/s]
>>>
>>> # Encode text
>>> input, lengths = processor(text)
>>>
>>> # Generate (mel-scale) spectrogram
>>> specgram, lengths, _ = tacotron2.infer(input, lengths)
>>>
>>> # Convert spectrogram to waveform
>>> waveforms, lengths = vocoder(specgram, lengths)
>>>
>>> torchaudio.save('hello-tts.wav', waveforms[0], vocoder.sample_rate)

Example - Phoneme-based TTS pipeline with Tacotron2 and WaveRNN

>>>
>>> # Note:
>>> #     This bundle uses pre-trained DeepPhonemizer as
>>> #     the text pre-processor.
>>> #     Please install deep-phonemizer.
>>> #     See https://github.com/as-ideas/DeepPhonemizer
>>> #     The pretrained weight is automatically downloaded.
>>>
>>> import torchaudio
>>>
>>> text = "Hello, TTS!"
>>> bundle = torchaudio.pipelines.TACOTRON2_WAVERNN_PHONEME_LJSPEECH
>>>
>>> # Build processor, Tacotron2 and WaveRNN model
>>> processor = bundle.get_text_processor()
Downloading:
100%|███████████████████████████████| 63.6M/63.6M [00:04<00:00, 15.3MB/s]
>>> tacotron2 = bundle.get_tacotron2()
Downloading:
100%|███████████████████████████████| 107M/107M [00:01<00:00, 87.9MB/s]
>>> vocoder = bundle.get_vocoder()
Downloading:
100%|███████████████████████████████| 16.7M/16.7M [00:00<00:00, 78.1MB/s]
>>>
>>> # Encode text
>>> input, lengths = processor(text)
>>>
>>> # Generate (mel-scale) spectrogram
>>> specgram, lengths, _ = tacotron2.infer(input, lengths)
>>>
>>> # Convert spectrogram to waveform
>>> waveforms, lengths = vocoder(specgram, lengths)
>>>
>>> torchaudio.save('hello-tts.wav', waveforms[0], vocoder.sample_rate)

abstract get_text_processor(self, *, dl_kwargs=None) → torchaudio.pipelines.Tacotron2TTSBundle.TextProcessor [source]¶

Create a text processor

For character-based pipeline, this processor splits the input text by character. For phoneme-based pipeline, this processor converts the input text (grapheme) to phonemes.

If a pre-trained weight file is necessary, torch.hub.download_url_to_file() is used to downloaded it.

Parameters: dl_kwargs (dictionary of keyword arguments,) – Passed to torch.hub.download_url_to_file().
Returns: A callable which takes a string or a list of strings as input and returns Tensor of encoded texts and Tensor of valid lengths. The object also has tokens property, which allows to recover the tokenized form.
Return type: TTSTextProcessor

Example - Character-based

>>> text = [
>>>     "Hello World!",
>>>     "Text-to-speech!",
>>> ]
>>> bundle = torchaudio.pipelines.TACOTRON2_WAVERNN_CHAR_LJSPEECH
>>> processor = bundle.get_text_processor()
>>> input, lengths = processor(text)
>>>
>>> print(input)
tensor([[19, 16, 23, 23, 26, 11, 34, 26, 29, 23, 15,  2,  0,  0,  0],
        [31, 16, 35, 31,  1, 31, 26,  1, 30, 27, 16, 16, 14, 19,  2]],
       dtype=torch.int32)
>>>
>>> print(lengths)
tensor([12, 15], dtype=torch.int32)
>>>
>>> print([processor.tokens[i] for i in input[0, :lengths[0]]])
['h', 'e', 'l', 'l', 'o', ' ', 'w', 'o', 'r', 'l', 'd', '!']
>>> print([processor.tokens[i] for i in input[1, :lengths[1]]])
['t', 'e', 'x', 't', '-', 't', 'o', '-', 's', 'p', 'e', 'e', 'c', 'h', '!']

Example - Phoneme-based

>>> text = [
>>>     "Hello, T T S !",
>>>     "Text-to-speech!",
>>> ]
>>> bundle = torchaudio.pipelines.TACOTRON2_WAVERNN_PHONE_LJSPEECH
>>> processor = bundle.get_text_processor()
Downloading:
100%|███████████████████████████████| 63.6M/63.6M [00:04<00:00, 15.3MB/s]
>>> input, lengths = processor(text)
>>>
>>> print(input)
tensor([[54, 20, 65, 69, 11, 92, 44, 65, 38,  2,  0,  0,  0,  0],
        [81, 40, 64, 79, 81,  1, 81, 20,  1, 79, 77, 59, 37,  2]],
       dtype=torch.int32)
>>>
>>> print(lengths)
tensor([10, 14], dtype=torch.int32)
>>>
>>> print([processor.tokens[i] for i in input[0]])
['HH', 'AH', 'L', 'OW', ' ', 'W', 'ER', 'L', 'D', '!', '_', '_', '_', '_']
>>> print([processor.tokens[i] for i in input[1]])
['T', 'EH', 'K', 'S', 'T', '-', 'T', 'AH', '-', 'S', 'P', 'IY', 'CH', '!']

abstract get_tacotron2(self, *, dl_kwargs=None) → torchaudio.models.Tacotron2 [source]¶

Create a Tacotron2 model with pre-trained weight.

Parameters: dl_kwargs (dictionary of keyword arguments) – Passed to torch.hub.load_state_dict_from_url().
Returns: The resulting model.
Return type: Tacotron2

abstract get_vocoder(self, *, dl_kwargs=None) → torchaudio.pipelines.Tacotron2TTSBundle.Vocoder [source]¶

Create a vocoder module, based off of either WaveRNN or GriffinLim.

If a pre-trained weight file is necessary, torch.hub.load_state_dict_from_url() is used to downloaded it.

Parameters: dl_kwargs (dictionary of keyword arguments) – Passed to torch.hub.load_state_dict_from_url().
Returns: A vocoder module, which takes spectrogram Tensor and an optional length Tensor, then returns resulting waveform Tensor and an optional length Tensor.
Return type: Callable[[Tensor, Optional[Tensor]], Tuple[Tensor, Optional[Tensor]]]

Examples using `Tacotron2TTSBundle`¶

Text-to-Speech with Tacotron2¶

Tacotron2TTSBundle - TextProcessor¶

class Tacotron2TTSBundle.TextProcessor[source]¶

Interface of the text processing part of Tacotron2TTS pipeline

See torchaudio.pipelines.Tacotron2TTSBundle.get_text_processor() for the usage.

abstract __call__(texts: Union[str, List[str]]) → Tuple[torch.Tensor, torch.Tensor]¶

Encode the given (batch of) texts into numerical tensors

See torchaudio.pipelines.Tacotron2TTSBundle.get_text_processor() for the usage.

Parameters

text (str or list of str) – The input texts.

Returns

Tensor:: The encoded texts. Shape: (batch, max length)
Tensor:: The valid length of each sample in the batch. Shape: (batch, ).

Return type

(Tensor, Tensor)

abstract property tokens¶

The tokens that the each value in the processed tensor represent.

See torchaudio.pipelines.Tacotron2TTSBundle.get_text_processor() for the usage.

Type: List[str]

Tacotron2TTSBundle - Vocoder¶

class Tacotron2TTSBundle.Vocoder[source]¶

Interface of the vocoder part of Tacotron2TTS pipeline

See torchaudio.pipelines.Tacotron2TTSBundle.get_vocoder() for the usage.

abstract __call__(specgrams: torch.Tensor, lengths: Optional[torch.Tensor] = None) → Tuple[torch.Tensor, Optional[torch.Tensor]]¶

Generate waveform from the given input, such as spectrogram

See torchaudio.pipelines.Tacotron2TTSBundle.get_vocoder() for the usage.

Parameters

specgrams (Tensor) – The input spectrogram. Shape: (batch, frequency bins, time). The expected shape depends on the implementation.
lengths (Tensor, or None, optional) – The valid length of each sample in the batch. Shape: (batch, ). (Default: None)

Returns

Tensor:: The generated waveform. Shape: (batch, max length)
Tensor or None:: The valid length of each sample in the batch. Shape: (batch, ).

Return type

(Tensor, Optional[Tensor])

abstract property sample_rate¶

The sample rate of the resulting waveform

See torchaudio.pipelines.Tacotron2TTSBundle.get_vocoder() for the usage.

Type: float

TACOTRON2_WAVERNN_PHONE_LJSPEECH¶

torchaudio.pipelines.TACOTRON2_WAVERNN_PHONE_LJSPEECH¶

Phoneme-based TTS pipeline with torchaudio.models.Tacotron2 and torchaudio.models.WaveRNN.

The text processor encodes the input texts based on phoneme. It uses DeepPhonemizer to convert graphemes to phonemes. The model (en_us_cmudict_forward) was trained on CMUDict.

Tacotron2 was trained on LJSpeech [9] for 1,500 epochs. You can find the training script here. The following parameters were used; win_length=1100, hop_length=275, n_fft=2048, mel_fmin=40, and mel_fmax=11025.

The vocder is based on torchaudio.models.WaveRNN. It was trained on 8 bits depth waveform of LJSpeech [9] for 10,000 epochs. You can find the training script here.

Please refer to torchaudio.pipelines.Tacotron2TTSBundle() for the usage.

Example - “Hello world! T T S stands for Text to Speech!”

Example - “The examination and testimony of the experts enabled the Commission to conclude that five shots may have been fired,”

TACOTRON2_WAVERNN_CHAR_LJSPEECH¶

torchaudio.pipelines.TACOTRON2_WAVERNN_CHAR_LJSPEECH¶

Character-based TTS pipeline with torchaudio.models.Tacotron2 and torchaudio.models.WaveRNN.

The text processor encodes the input texts character-by-character.

Tacotron2 was trained on LJSpeech [9] for 1,500 epochs. You can find the training script here. The following parameters were used; win_length=1100, hop_length=275, n_fft=2048, mel_fmin=40, and mel_fmax=11025.

The vocder is based on torchaudio.models.WaveRNN. It was trained on 8 bits depth waveform of LJSpeech [9] for 10,000 epochs. You can find the training script here.

Please refer to torchaudio.pipelines.Tacotron2TTSBundle() for the usage.

Example - “Hello world! T T S stands for Text to Speech!”

Example - “The examination and testimony of the experts enabled the Commission to conclude that five shots may have been fired,”

TACOTRON2_GRIFFINLIM_PHONE_LJSPEECH¶

torchaudio.pipelines.TACOTRON2_GRIFFINLIM_PHONE_LJSPEECH¶

Phoneme-based TTS pipeline with torchaudio.models.Tacotron2 and torchaudio.transforms.GriffinLim.

The text processor encodes the input texts based on phoneme. It uses DeepPhonemizer to convert graphemes to phonemes. The model (en_us_cmudict_forward) was trained on CMUDict.

Tacotron2 was trained on LJSpeech [9] for 1,500 epochs. You can find the training script here. The text processor is set to the “english_phonemes”.

The vocoder is based on torchaudio.transforms.GriffinLim.

Please refer to torchaudio.pipelines.Tacotron2TTSBundle() for the usage.

Example - “Hello world! T T S stands for Text to Speech!”

Example - “The examination and testimony of the experts enabled the Commission to conclude that five shots may have been fired,”

TACOTRON2_GRIFFINLIM_CHAR_LJSPEECH¶

torchaudio.pipelines.TACOTRON2_GRIFFINLIM_CHAR_LJSPEECH¶

Character-based TTS pipeline with torchaudio.models.Tacotron2 and torchaudio.transforms.GriffinLim.

The text processor encodes the input texts character-by-character.

Tacotron2 was trained on LJSpeech [9] for 1,500 epochs. You can find the training script here. The default parameters were used.

The vocoder is based on torchaudio.transforms.GriffinLim.

Please refer to torchaudio.pipelines.Tacotron2TTSBundle() for the usage.

Example - “Hello world! T T S stands for Text to Speech!”

Example - “The examination and testimony of the experts enabled the Commission to conclude that five shots may have been fired,”

References¶

1(1,2,3,4,5,6,7,8,9,10,11,12,13): Vassil Panayotov, Guoguo Chen, Daniel Povey, and Sanjeev Khudanpur. Librispeech: an asr corpus based on public domain audio books. In 2015 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), volume, 5206–5210. 2015. doi:10.1109/ICASSP.2015.7178964.
2(1,2,3,4,5,6,7,8,9,10,11,12): Alexei Baevski, Henry Zhou, Abdelrahman Mohamed, and Michael Auli. Wav2vec 2.0: a framework for self-supervised learning of speech representations. 2020. arXiv:2006.11477.
3(1,2,3,4,5,6,7,8,9,10): J. Kahn, M. Rivière, W. Zheng, E. Kharitonov, Q. Xu, P. E. Mazaré, J. Karadayi, V. Liptchinsky, R. Collobert, C. Fuegen, T. Likhomanenko, G. Synnaeve, A. Joulin, A. Mohamed, and E. Dupoux. Libri-light: a benchmark for asr with limited or no supervision. In ICASSP 2020 - 2020 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), 7669–7673. 2020. https://github.com/facebookresearch/libri-light.
4: Vineel Pratap, Qiantong Xu, Anuroop Sriram, Gabriel Synnaeve, and Ronan Collobert. Mls: a large-scale multilingual dataset for speech research. Interspeech 2020, Oct 2020. URL: http://dx.doi.org/10.21437/Interspeech.2020-2826, doi:10.21437/interspeech.2020-2826.
5: Rosana Ardila, Megan Branson, Kelly Davis, Michael Henretty, Michael Kohler, Josh Meyer, Reuben Morais, Lindsay Saunders, Francis M. Tyers, and Gregor Weber. Common voice: a massively-multilingual speech corpus. 2020. arXiv:1912.06670.
6: Mark John Francis Gales, Kate Knill, Anton Ragni, and Shakti Prasad Rath. Speech recognition and keyword spotting for low-resource languages: babel project research at cued. In SLTU. 2014.
7: Alexis Conneau, Alexei Baevski, Ronan Collobert, Abdelrahman Mohamed, and Michael Auli. Unsupervised cross-lingual representation learning for speech recognition. 2020. arXiv:2006.13979.
8(1,2,3,4,5): Wei-Ning Hsu, Benjamin Bolte, Yao-Hung Hubert Tsai, Kushal Lakhotia, Ruslan Salakhutdinov, and Abdelrahman Mohamed. Hubert: self-supervised speech representation learning by masked prediction of hidden units. 2021. arXiv:2106.07447.
9(1,2,3,4,5,6): Keith Ito and Linda Johnson. The lj speech dataset. https://keithito.com/LJ-Speech-Dataset/, 2017.

torchaudio.pipelines¶

wav2vec 2.0 / HuBERT - Representation Learning¶

WAV2VEC2_BASE¶

WAV2VEC2_LARGE¶

WAV2VEC2_LARGE_LV60K¶

WAV2VEC2_XLSR53¶

HUBERT_BASE¶

HUBERT_LARGE¶

HUBERT_XLARGE¶

wav2vec 2.0 / HuBERT - Fine-tuned ASR¶

Wav2Vec2ASRBundle¶

Examples using `Wav2Vec2ASRBundle`¶

WAV2VEC2_ASR_BASE_10M¶

WAV2VEC2_ASR_BASE_100H¶

WAV2VEC2_ASR_BASE_960H¶

WAV2VEC2_ASR_LARGE_10M¶

WAV2VEC2_ASR_LARGE_100H¶

WAV2VEC2_ASR_LARGE_960H¶

WAV2VEC2_ASR_LARGE_LV60K_10M¶

WAV2VEC2_ASR_LARGE_LV60K_100H¶

WAV2VEC2_ASR_LARGE_LV60K_960H¶

HUBERT_ASR_LARGE¶

HUBERT_ASR_XLARGE¶

Tacotron2 Text-To-Speech¶

Tacotron2TTSBundle¶

Examples using `Tacotron2TTSBundle`¶

Tacotron2TTSBundle - TextProcessor¶

Tacotron2TTSBundle - Vocoder¶

TACOTRON2_WAVERNN_PHONE_LJSPEECH¶

TACOTRON2_WAVERNN_CHAR_LJSPEECH¶

TACOTRON2_GRIFFINLIM_PHONE_LJSPEECH¶

TACOTRON2_GRIFFINLIM_CHAR_LJSPEECH¶

References¶

Docs

Tutorials

Resources

torchaudio.pipelines¶

wav2vec 2.0 / HuBERT - Representation Learning¶

WAV2VEC2_BASE¶

WAV2VEC2_LARGE¶

WAV2VEC2_LARGE_LV60K¶

WAV2VEC2_XLSR53¶

HUBERT_BASE¶

HUBERT_LARGE¶

HUBERT_XLARGE¶

wav2vec 2.0 / HuBERT - Fine-tuned ASR¶

Wav2Vec2ASRBundle¶

Examples using Wav2Vec2ASRBundle¶

WAV2VEC2_ASR_BASE_10M¶

WAV2VEC2_ASR_BASE_100H¶

WAV2VEC2_ASR_BASE_960H¶

WAV2VEC2_ASR_LARGE_10M¶

WAV2VEC2_ASR_LARGE_100H¶

WAV2VEC2_ASR_LARGE_960H¶

WAV2VEC2_ASR_LARGE_LV60K_10M¶

WAV2VEC2_ASR_LARGE_LV60K_100H¶

WAV2VEC2_ASR_LARGE_LV60K_960H¶

HUBERT_ASR_LARGE¶

HUBERT_ASR_XLARGE¶

Tacotron2 Text-To-Speech¶

Tacotron2TTSBundle¶

Examples using Tacotron2TTSBundle¶

Tacotron2TTSBundle - TextProcessor¶

Tacotron2TTSBundle - Vocoder¶

TACOTRON2_WAVERNN_PHONE_LJSPEECH¶

TACOTRON2_WAVERNN_CHAR_LJSPEECH¶

TACOTRON2_GRIFFINLIM_PHONE_LJSPEECH¶

TACOTRON2_GRIFFINLIM_CHAR_LJSPEECH¶

References¶

Docs

Tutorials

Resources

Examples using `Wav2Vec2ASRBundle`¶

Examples using `Tacotron2TTSBundle`¶