• Docs >
  • torchaudio.models.decoder


Decoder Class


class torchaudio.models.decoder.CTCDecoder(nbest: int, lexicon: Optional[Dict], word_dict: torchaudio._torchaudio_decoder._Dictionary, tokens_dict: torchaudio._torchaudio_decoder._Dictionary, lm: torchaudio._torchaudio_decoder._LM, decoder_options: Union[torchaudio._torchaudio_decoder._LexiconDecoderOptions, torchaudio._torchaudio_decoder._LexiconFreeDecoderOptions], blank_token: str, sil_token: str, unk_word: str)[source]
This feature supports the following devices: CPU

CTC beam search decoder from Flashlight [1].


To build the decoder, please use the factory function ctc_decoder().

  • nbest (int) – number of best decodings to return

  • lexicon (Dict or None) – lexicon mapping of words to spellings, or None for lexicon-free decoder

  • word_dict (_Dictionary) – dictionary of words

  • tokens_dict (_Dictionary) – dictionary of tokens

  • lm (_LM) – language model

  • decoder_options (_LexiconDecoderOptions or _LexiconFreeDecoderOptions) – parameters used for beam search decoding

  • blank_token (str) – token corresopnding to blank

  • sil_token (str) – token corresponding to silence

  • unk_word (str) – word corresponding to unknown

Tutorials using CTCDecoder:
__call__(self, emissions: torch.FloatTensor, lengths: Optional[torch.Tensor] = None)List[List[torchaudio.models.decoder.CTCHypothesis]][source]
  • emissions (torch.FloatTensor) – CPU tensor of shape (batch, frame, num_tokens) storing sequences of probability distribution over labels; output of acoustic model.

  • lengths (Tensor or None, optional) – CPU tensor of shape (batch, ) storing the valid length of in time axis of the output Tensor in each batch.


List of sorted best hypotheses for each audio sequence in the batch.

Return type


idxs_to_tokens(idxs: torch.LongTensor)List[source]

Map raw token IDs into corresponding tokens


idxs (LongTensor) – raw token IDs generated from decoder


tokens corresponding to the input IDs

Return type



class torchaudio.models.decoder.CTCHypothesis(tokens: torch.LongTensor, words: List[str], score: float, timesteps: torch.IntTensor)[source]

Represents hypothesis generated by CTC beam search decoder CTCDecoder().

  • tokens (torch.LongTensor) – Predicted sequence of token IDs. Shape (L, ), where L is the length of the output sequence

  • words (List[str]) – List of predicted words

  • score (float) – Score corresponding to hypothesis

  • timesteps (torch.IntTensor) – Timesteps corresponding to the tokens. Shape (L, ), where L is the length of the output sequence

Tutorials using CTCHypothesis:

Factory Function


class torchaudio.models.decoder.ctc_decoder(lexicon: Optional[str], tokens: Union[str, List[str]], lm: Optional[str] = None, nbest: int = 1, beam_size: int = 50, beam_size_token: Optional[int] = None, beam_threshold: float = 50, lm_weight: float = 2, word_score: float = 0, unk_score: float = - inf, sil_score: float = 0, log_add: bool = False, blank_token: str = '-', sil_token: str = '|', unk_word: str = '<unk>')[source]

Builds CTC beam search decoder from Flashlight [1].

  • lexicon (str or None) – lexicon file containing the possible words and corresponding spellings. Each line consists of a word and its space separated spelling. If None, uses lexicon-free decoding.

  • tokens (str or List[str]) – file or list containing valid tokens. If using a file, the expected format is for tokens mapping to the same index to be on the same line

  • lm (str or None, optional) – file containing language model, or None if not using a language model

  • nbest (int, optional) – number of best decodings to return (Default: 1)

  • beam_size (int, optional) – max number of hypos to hold after each decode step (Default: 50)

  • beam_size_token (int, optional) – max number of tokens to consider at each decode step. If None, it is set to the total number of tokens (Default: None)

  • beam_threshold (float, optional) – threshold for pruning hypothesis (Default: 50)

  • lm_weight (float, optional) – weight of language model (Default: 2)

  • word_score (float, optional) – word insertion score (Default: 0)

  • unk_score (float, optional) – unknown word insertion score (Default: -inf)

  • sil_score (float, optional) – silence insertion score (Default: 0)

  • log_add (bool, optional) – whether or not to use logadd when merging hypotheses (Default: False)

  • blank_token (str, optional) – token corresponding to blank (Default: “-“)

  • sil_token (str, optional) – token corresponding to silence (Default: “|”)

  • unk_word (str, optional) – word corresponding to unknown (Default: “<unk>”)



Return type


>>> decoder = ctc_decoder(
>>>     lexicon="lexicon.txt",
>>>     tokens="tokens.txt",
>>>     lm="kenlm.bin",
>>> )
>>> results = decoder(emissions) # List of shape (B, nbest) of Hypotheses
Tutorials using ctc_decoder:

Utility Function


class torchaudio.models.decoder.download_pretrained_files(model: str)[source]

Retrieves pretrained data files used for CTC decoder.


model (str) – pretrained language model to download. Options: [“librispeech-3-gram”, “librispeech-4-gram”, “librispeech”]


Object with the following attributes

path corresponding to downloaded language model, or None if the model is not associated with an lm


path corresponding to downloaded lexicon file


path corresponding to downloaded tokens file

Tutorials using download_pretrained_files:



Jacob Kahn, Vineel Pratap, Tatiana Likhomanenko, Qiantong Xu, Awni Hannun, Jeff Cai, Paden Tomasello, Ann Lee, Edouard Grave, Gilad Avidov, and others. Flashlight: enabling innovation in tools for machine learning. arXiv preprint arXiv:2201.12465, 2022.


Access comprehensive developer documentation for PyTorch

View Docs


Get in-depth tutorials for beginners and advanced developers

View Tutorials


Find development resources and get your questions answered

View Resources