Tacotron2¶

class torchaudio.models.Tacotron2(mask_padding: bool = False, n_mels: int = 80, n_symbol: int = 148, n_frames_per_step: int = 1, symbol_embedding_dim: int = 512, encoder_embedding_dim: int = 512, encoder_n_convolution: int = 3, encoder_kernel_size: int = 5, decoder_rnn_dim: int = 1024, decoder_max_step: int = 2000, decoder_dropout: float = 0.1, decoder_early_stopping: bool = True, attention_rnn_dim: int = 1024, attention_hidden_dim: int = 128, attention_location_n_filter: int = 32, attention_location_kernel_size: int = 31, attention_dropout: float = 0.1, prenet_dim: int = 256, postnet_n_convolution: int = 5, postnet_kernel_size: int = 5, postnet_embedding_dim: int = 512, gate_threshold: float = 0.5)[source]¶

Tacotron2 model from Natural TTS Synthesis by Conditioning WaveNet on Mel Spectrogram Predictions [Shen et al., 2018] based on the implementation from Nvidia Deep Learning Examples.

Methods¶

forward¶

Tacotron2.forward(tokens: Tensor, token_lengths: Tensor, mel_specgram: Tensor, mel_specgram_lengths: Tensor) → Tuple[Tensor, Tensor, Tensor, Tensor][source]¶

Pass the input through the Tacotron2 model. This is in teacher forcing mode, which is generally used for training.

The input tokens should be padded with zeros to length max of token_lengths. The input mel_specgram should be padded with zeros to length max of mel_specgram_lengths.

Parameters:

tokens (Tensor) – The input tokens to Tacotron2 with shape (n_batch, max of token_lengths).
token_lengths (Tensor) – The valid length of each sample in tokens with shape (n_batch, ).
mel_specgram (Tensor) – The target mel spectrogram with shape (n_batch, n_mels, max of mel_specgram_lengths).
mel_specgram_lengths (Tensor) – The length of each mel spectrogram with shape (n_batch, ).

Returns:

Tensor: Mel spectrogram before Postnet with shape (n_batch, n_mels, max of mel_specgram_lengths).
Tensor: Mel spectrogram after Postnet with shape (n_batch, n_mels, max of mel_specgram_lengths).
Tensor: The output for stop token at each time step with shape (n_batch, max of mel_specgram_lengths).
Tensor: Sequence of attention weights from the decoder with shape (n_batch, max of mel_specgram_lengths, max of token_lengths).

Return type:

[Tensor, Tensor, Tensor, Tensor]

infer¶

Tacotron2.infer(tokens: Tensor, lengths: Optional[Tensor] = None) → Tuple[Tensor, Tensor, Tensor][source]¶

Using Tacotron2 for inference. The input is a batch of encoded sentences (tokens) and its corresponding lengths (lengths). The output is the generated mel spectrograms, its corresponding lengths, and the attention weights from the decoder.

The input tokens should be padded with zeros to length max of lengths.

Parameters:

tokens (Tensor) – The input tokens to Tacotron2 with shape (n_batch, max of lengths).
lengths (Tensor or None, optional) – The valid length of each sample in tokens with shape (n_batch, ). If None, it is assumed that the all the tokens are valid. Default: None

Returns:

Tensor: The predicted mel spectrogram with shape (n_batch, n_mels, max of mel_specgram_lengths).
Tensor: The length of the predicted mel spectrogram with shape (n_batch, ).
Tensor: Sequence of attention weights from the decoder with shape (n_batch, max of mel_specgram_lengths, max of lengths).

Return type:

(Tensor, Tensor, Tensor)

Tacotron2¶

Methods¶

forward¶

infer¶

Docs

Tutorials

Resources