Shortcuts

torchaudio.prototype.ctc_decoder

Decoder Class

CTCDecoder

class torchaudio.prototype.ctc_decoder.CTCDecoder[source]
This feature supports the following devices: CPU

Lexically contrained CTC beam search decoder from Flashlight [1].

Note

To build the decoder, please use factory function ctc_decoder().

Parameters
  • nbest (int) – number of best decodings to return

  • lexicon (Dict or None) – lexicon mapping of words to spellings, or None for lexicon free decoder

  • word_dict (_Dictionary) – dictionary of words

  • tokens_dict (_Dictionary) – dictionary of tokens

  • lm (_LM) – language model

  • decoder_options (_LexiconDecoderOptions or _LexiconFreeDecoderOptions) – parameters used for beam search decoding

  • blank_token (str) – token corresopnding to blank

  • sil_token (str) – token corresponding to silence

  • unk_word (str) – word corresponding to unknown

Tutorials using CTCDecoder:
__call__(self, emissions: torch.FloatTensor, lengths: Optional[torch.Tensor] = None)List[List[torchaudio.prototype.ctc_decoder.Hypothesis]][source]
Parameters
  • emissions (torch.FloatTensor) – CPU tensor of shape (batch, frame, num_tokens) storing sequences of probability distribution over labels; output of acoustic model.

  • lengths (Tensor or None, optional) – CPU tensor of shape (batch, ) storing the valid length of in time axis of the output Tensor in each batch.

Returns

List of sorted best hypotheses for each audio sequence in the batch.

Return type

List[List[Hypothesis]]

idxs_to_tokens(idxs: torch.LongTensor)List[source]

Map raw token IDs into corresponding tokens

Parameters

idxs (LongTensor) – raw token IDs generated from decoder

Returns

tokens corresponding to the input IDs

Return type

List

Hypothesis

class torchaudio.prototype.ctc_decoder.Hypothesis(tokens: torch.LongTensor, words: List[str], score: float, timesteps: torch.IntTensor)[source]

Represents hypothesis generated by CTC beam search decoder :py:func`CTCDecoder`.

Variables
  • tokens (torch.LongTensor) – Predicted sequence of token IDs. Shape (L, ), where L is the length of the output sequence

  • words (List[str]) – List of predicted words

  • score (float) – Score corresponding to hypothesis

  • timesteps (torch.IntTensor) – Timesteps corresponding to the tokens. Shape (L, ), where L is the length of the output sequence

Tutorials using Hypothesis:

Factory Function

ctc_decoder

class torchaudio.prototype.ctc_decoder.ctc_decoder(lexicon: Optional[str], tokens: Union[str, List[str]], lm: Optional[str] = None, nbest: int = 1, beam_size: int = 50, beam_size_token: Optional[int] = None, beam_threshold: float = 50, lm_weight: float = 2, word_score: float = 0, unk_score: float = - inf, sil_score: float = 0, log_add: bool = False, blank_token: str = '-', sil_token: str = '|', unk_word: str = '<unk>')[source]

Builds lexically constrained CTC beam search decoder from Flashlight [1].

Parameters
  • lexicon (str or None) – lexicon file containing the possible words and corresponding spellings. Each line consists of a word and its space separated spelling. If None, uses lexicon free decoding.

  • tokens (str or List[str]) – file or list containing valid tokens. If using a file, the expected format is for tokens mapping to the same index to be on the same line

  • lm (str or None, optional) – file containing language model, or None if not using a language model

  • nbest (int, optional) – number of best decodings to return (Default: 1)

  • beam_size (int, optional) – max number of hypos to hold after each decode step (Default: 50)

  • beam_size_token (int, optional) – max number of tokens to consider at each decode step. If None, it is set to the total number of tokens (Default: None)

  • beam_threshold (float, optional) – threshold for pruning hypothesis (Default: 50)

  • lm_weight (float, optional) – weight of language model (Default: 2)

  • word_score (float, optional) – word insertion score (Default: 0)

  • unk_score (float, optional) – unknown word insertion score (Default: -inf)

  • sil_score (float, optional) – silence insertion score (Default: 0)

  • log_add (bool, optional) – whether or not to use logadd when merging hypotheses (Default: False)

  • blank_token (str, optional) – token corresponding to blank (Default: “-“)

  • sil_token (str, optional) – token corresponding to silence (Default: “|”)

  • unk_word (str, optional) – word corresponding to unknown (Default: “<unk>”)

Returns

decoder

Return type

CTCDecoder

Example
>>> decoder = ctc_decoder(
>>>     lexicon="lexicon.txt",
>>>     tokens="tokens.txt",
>>>     lm="kenlm.bin",
>>> )
>>> results = decoder(emissions) # List of shape (B, nbest) of Hypotheses
Tutorials using ctc_decoder:

Utility Function

download_pretrained_files

class torchaudio.prototype.ctc_decoder.download_pretrained_files(model: str)[source]

Retrieves pretrained data files used for CTC decoder.

Parameters

model (str) – pretrained language model to download. Options: [“librispeech-3-gram”, “librispeech-4-gram”, “librispeech”]

Returns

Object with the following attributes
lm:

path corresponding to downloaded language model, or None if the model is not associated with an lm

lexicon:

path corresponding to downloaded lexicon file

tokens:

path corresponding to downloaded tokens file

Tutorials using download_pretrained_files:

References

1(1,2)

Jacob Kahn, Vineel Pratap, Tatiana Likhomanenko, Qiantong Xu, Awni Hannun, Jeff Cai, Paden Tomasello, Ann Lee, Edouard Grave, Gilad Avidov, and others. Flashlight: enabling innovation in tools for machine learning. arXiv preprint arXiv:2201.12465, 2022.

Docs

Access comprehensive developer documentation for PyTorch

View Docs

Tutorials

Get in-depth tutorials for beginners and advanced developers

View Tutorials

Resources

Find development resources and get your questions answered

View Resources