RNNTBeamSearch¶
- class torchaudio.models.RNNTBeamSearch(model: RNNT, blank: int, temperature: float = 1.0, hypo_sort_key: Optional[Callable[[Tuple[List[int], Tensor, List[List[Tensor]], float]], float]] = None, step_max_tokens: int = 100)[source]¶
Beam search decoder for RNN-T model.
See also
torchaudio.pipelines.RNNTBundle
: ASR pipeline with pretrained model.
- Parameters:
model (RNNT) – RNN-T model to use.
blank (int) – index of blank token in vocabulary.
temperature (float, optional) – temperature to apply to joint network output. Larger values yield more uniform samples. (Default: 1.0)
hypo_sort_key (Callable[[Hypothesis], float] or None, optional) – callable that computes a score for a given hypothesis to rank hypotheses by. If
None
, defaults to callable that returns hypothesis score normalized by token sequence length. (Default: None)step_max_tokens (int, optional) – maximum number of tokens to emit per input time step. (Default: 100)
- Tutorials using
RNNTBeamSearch
: Online ASR with Emformer RNN-T
Online ASR with Emformer RNN-TDevice AV-ASR with Emformer RNN-T
Device AV-ASR with Emformer RNN-T
Methods¶
forward¶
- RNNTBeamSearch.forward(input: Tensor, length: Tensor, beam_width: int) List[Tuple[List[int], Tensor, List[List[Tensor]], float]] [source]¶
Performs beam search for the given input sequence.
T: number of frames; D: feature dimension of each frame.
- Parameters:
input (torch.Tensor) – sequence of input frames, with shape (T, D) or (1, T, D).
length (torch.Tensor) – number of valid frames in input sequence, with shape () or (1,).
beam_width (int) – beam size to use during search.
- Returns:
top-
beam_width
hypotheses found by beam search.- Return type:
List[Hypothesis]
infer¶
- RNNTBeamSearch.infer(input: Tensor, length: Tensor, beam_width: int, state: Optional[List[List[Tensor]]] = None, hypothesis: Optional[List[Tuple[List[int], Tensor, List[List[Tensor]], float]]] = None) Tuple[List[Tuple[List[int], Tensor, List[List[Tensor]], float]], List[List[Tensor]]] [source]¶
Performs beam search for the given input sequence in streaming mode.
T: number of frames; D: feature dimension of each frame.
- Parameters:
input (torch.Tensor) – sequence of input frames, with shape (T, D) or (1, T, D).
length (torch.Tensor) – number of valid frames in input sequence, with shape () or (1,).
beam_width (int) – beam size to use during search.
state (List[List[torch.Tensor]] or None, optional) – list of lists of tensors representing transcription network internal state generated in preceding invocation. (Default:
None
)hypothesis (List[Hypothesis] or None) – hypotheses from preceding invocation to seed search with. (Default:
None
)
- Returns:
- List[Hypothesis]
top-
beam_width
hypotheses found by beam search.- List[List[torch.Tensor]]
list of lists of tensors representing transcription network internal state generated in current invocation.
- Return type:
(List[Hypothesis], List[List[torch.Tensor]])