torchaudio.models.decoder¶
Decoder Class¶
CTCDecoder¶
-
class
torchaudio.models.decoder.
CTCDecoder
(nbest: int, lexicon: Optional[Dict], word_dict: torchaudio._torchaudio_decoder._Dictionary, tokens_dict: torchaudio._torchaudio_decoder._Dictionary, lm: torchaudio._torchaudio_decoder._LM, decoder_options: Union[torchaudio._torchaudio_decoder._LexiconDecoderOptions, torchaudio._torchaudio_decoder._LexiconFreeDecoderOptions], blank_token: str, sil_token: str, unk_word: str)[source]¶ -
CTC beam search decoder from Flashlight [1].
Note
To build the decoder, please use the factory function
ctc_decoder()
.- Parameters
nbest (int) – number of best decodings to return
lexicon (Dict or None) – lexicon mapping of words to spellings, or None for lexicon-free decoder
word_dict (_Dictionary) – dictionary of words
tokens_dict (_Dictionary) – dictionary of tokens
lm (_LM) – language model
decoder_options (_LexiconDecoderOptions or _LexiconFreeDecoderOptions) – parameters used for beam search decoding
blank_token (str) – token corresopnding to blank
sil_token (str) – token corresponding to silence
unk_word (str) – word corresponding to unknown
- Tutorials using
CTCDecoder
:
-
__call__
(self, emissions: torch.FloatTensor, lengths: Optional[torch.Tensor] = None) → List[List[torchaudio.models.decoder.CTCHypothesis]][source]¶ - Parameters
emissions (torch.FloatTensor) – CPU tensor of shape (batch, frame, num_tokens) storing sequences of probability distribution over labels; output of acoustic model.
lengths (Tensor or None, optional) – CPU tensor of shape (batch, ) storing the valid length of in time axis of the output Tensor in each batch.
- Returns
List of sorted best hypotheses for each audio sequence in the batch.
- Return type
List[List[CTCHypothesis]]
CTCHypothesis¶
-
class
torchaudio.models.decoder.
CTCHypothesis
(tokens: torch.LongTensor, words: List[str], score: float, timesteps: torch.IntTensor)[source]¶ Represents hypothesis generated by CTC beam search decoder
CTCDecoder()
.- Variables
tokens (torch.LongTensor) – Predicted sequence of token IDs. Shape (L, ), where L is the length of the output sequence
words (List[str]) – List of predicted words
score (float) – Score corresponding to hypothesis
timesteps (torch.IntTensor) – Timesteps corresponding to the tokens. Shape (L, ), where L is the length of the output sequence
- Tutorials using
CTCHypothesis
:
Factory Function¶
ctc_decoder¶
-
class
torchaudio.models.decoder.
ctc_decoder
(lexicon: Optional[str], tokens: Union[str, List[str]], lm: Optional[str] = None, nbest: int = 1, beam_size: int = 50, beam_size_token: Optional[int] = None, beam_threshold: float = 50, lm_weight: float = 2, word_score: float = 0, unk_score: float = - inf, sil_score: float = 0, log_add: bool = False, blank_token: str = '-', sil_token: str = '|', unk_word: str = '<unk>')[source]¶ Builds CTC beam search decoder from Flashlight [1].
- Parameters
lexicon (str or None) – lexicon file containing the possible words and corresponding spellings. Each line consists of a word and its space separated spelling. If None, uses lexicon-free decoding.
tokens (str or List[str]) – file or list containing valid tokens. If using a file, the expected format is for tokens mapping to the same index to be on the same line
lm (str or None, optional) – file containing language model, or None if not using a language model
nbest (int, optional) – number of best decodings to return (Default: 1)
beam_size (int, optional) – max number of hypos to hold after each decode step (Default: 50)
beam_size_token (int, optional) – max number of tokens to consider at each decode step. If None, it is set to the total number of tokens (Default: None)
beam_threshold (float, optional) – threshold for pruning hypothesis (Default: 50)
lm_weight (float, optional) – weight of language model (Default: 2)
word_score (float, optional) – word insertion score (Default: 0)
unk_score (float, optional) – unknown word insertion score (Default: -inf)
sil_score (float, optional) – silence insertion score (Default: 0)
log_add (bool, optional) – whether or not to use logadd when merging hypotheses (Default: False)
blank_token (str, optional) – token corresponding to blank (Default: “-“)
sil_token (str, optional) – token corresponding to silence (Default: “|”)
unk_word (str, optional) – word corresponding to unknown (Default: “<unk>”)
- Returns
decoder
- Return type
- Example
>>> decoder = ctc_decoder( >>> lexicon="lexicon.txt", >>> tokens="tokens.txt", >>> lm="kenlm.bin", >>> ) >>> results = decoder(emissions) # List of shape (B, nbest) of Hypotheses
- Tutorials using
ctc_decoder
:
Utility Function¶
download_pretrained_files¶
-
class
torchaudio.models.decoder.
download_pretrained_files
(model: str)[source]¶ Retrieves pretrained data files used for CTC decoder.
- Parameters
model (str) – pretrained language model to download. Options: [“librispeech-3-gram”, “librispeech-4-gram”, “librispeech”]
- Returns
- Object with the following attributes
- lm:
path corresponding to downloaded language model, or None if the model is not associated with an lm
- lexicon:
path corresponding to downloaded lexicon file
- tokens:
path corresponding to downloaded tokens file
- Tutorials using
download_pretrained_files
: