torchaudio.functional.compute_kaldi_pitch¶
- torchaudio.functional.compute_kaldi_pitch(waveform: Tensor, sample_rate: float, frame_length: float = 25.0, frame_shift: float = 10.0, min_f0: float = 50, max_f0: float = 400, soft_min_f0: float = 10.0, penalty_factor: float = 0.1, lowpass_cutoff: float = 1000, resample_frequency: float = 4000, delta_pitch: float = 0.005, nccf_ballast: float = 7000, lowpass_filter_width: int = 1, upsample_filter_width: int = 5, max_frames_latency: int = 0, frames_per_chunk: int = 0, simulate_first_pass_online: bool = False, recompute_frame: int = 500, snip_edges: bool = True) Tensor [source]¶
Extract pitch based on method described in A pitch extraction algorithm tuned for automatic speech recognition [Ghahremani et al., 2014].
This function computes the equivalent of compute-kaldi-pitch-feats from Kaldi.
- Parameters:
waveform (Tensor) – The input waveform of shape (…, time).
sample_rate (float) – Sample rate of waveform.
frame_length (float, optional) – Frame length in milliseconds. (default: 25.0)
frame_shift (float, optional) – Frame shift in milliseconds. (default: 10.0)
min_f0 (float, optional) – Minimum F0 to search for (Hz) (default: 50.0)
max_f0 (float, optional) – Maximum F0 to search for (Hz) (default: 400.0)
soft_min_f0 (float, optional) – Minimum f0, applied in soft way, must not exceed min-f0 (default: 10.0)
penalty_factor (float, optional) – Cost factor for FO change. (default: 0.1)
lowpass_cutoff (float, optional) – Cutoff frequency for LowPass filter (Hz) (default: 1000)
resample_frequency (float, optional) – Frequency that we down-sample the signal to. Must be more than twice lowpass-cutoff. (default: 4000)
delta_pitch (float, optional) – Smallest relative change in pitch that our algorithm measures. (default: 0.005)
nccf_ballast (float, optional) – Increasing this factor reduces NCCF for quiet frames (default: 7000)
lowpass_filter_width (int, optional) – Integer that determines filter width of lowpass filter, more gives sharper filter. (default: 1)
upsample_filter_width (int, optional) – Integer that determines filter width when upsampling NCCF. (default: 5)
max_frames_latency (int, optional) – Maximum number of frames of latency that we allow pitch tracking to introduce into the feature processing (affects output only if
frames_per_chunk > 0
andsimulate_first_pass_online=True
) (default: 0)frames_per_chunk (int, optional) – The number of frames used for energy normalization. (default: 0)
simulate_first_pass_online (bool, optional) – If true, the function will output features that correspond to what an online decoder would see in the first pass of decoding – not the final version of the features, which is the default. (default: False) Relevant if
frames_per_chunk > 0
.recompute_frame (int, optional) – Only relevant for compatibility with online pitch extraction. A non-critical parameter; the frame at which we recompute some of the forward pointers, after revising our estimate of the signal energy. Relevant if
frames_per_chunk > 0
. (default: 500)snip_edges (bool, optional) – If this is set to false, the incomplete frames near the ending edge won’t be snipped, so that the number of frames is the file size divided by the frame-shift. This makes different types of features give the same number of frames. (default: True)
- Returns:
Pitch feature. Shape: (batch, frames 2) where the last dimension corresponds to pitch and NCCF.
- Return type:
Tensor
- Tutorials using
compute_kaldi_pitch
: - Audio Feature Extractions