• Docs >
  • torchaudio.functional
Shortcuts

torchaudio.functional

Functions to perform common audio operations.

istft

torchaudio.functional.istft(stft_matrix: torch.Tensor, n_fft: int, hop_length: Optional[int] = None, win_length: Optional[int] = None, window: Optional[torch.Tensor] = None, center: bool = True, pad_mode: Optional[str] = None, normalized: bool = False, onesided: bool = True, length: Optional[int] = None) → torch.Tensor[source]

Inverse short time Fourier Transform. This is expected to be the inverse of torch.stft. It has the same parameters (+ additional optional parameter of length) and it should return the least squares estimation of the original signal. The algorithm will check using the NOLA condition ( nonzero overlap).

Important consideration in the parameters window and center so that the envelop created by the summation of all the windows is never zero at certain point in time. Specifically, \(\sum_{t=-\infty}^{\infty} w^2[n-t\times hop\_length] \cancel{=} 0\).

Since stft discards elements at the end of the signal if they do not fit in a frame, the istft may return a shorter signal than the original signal (can occur if center is False since the signal isn’t padded).

If center is True, then there will be padding e.g. ‘constant’, ‘reflect’, etc. Left padding can be trimmed off exactly because they can be calculated but right padding cannot be calculated without additional information.

Example: Suppose the last window is: [17, 18, 0, 0, 0] vs [18, 0, 0, 0, 0]

The n_frame, hop_length, win_length are all the same which prevents the calculation of right padding. These additional values could be zeros or a reflection of the signal so providing length could be useful. If length is None then padding will be aggressively removed (some loss of signal).

[1] D. W. Griffin and J. S. Lim, “Signal estimation from modified short-time Fourier transform,” IEEE Trans. ASSP, vol.32, no.2, pp.236-243, Apr. 1984.

Parameters
  • stft_matrix (Tensor) – Output of stft where each row of a channel is a frequency and each column is a window. It has a size of either (…, fft_size, n_frame, 2)

  • n_fft (int) – Size of Fourier transform

  • hop_length (int or None, optional) – The distance between neighboring sliding window frames. (Default: win_length // 4)

  • win_length (int or None, optional) – The size of window frame and STFT filter. (Default: n_fft)

  • window (Tensor or None, optional) – The optional window function. (Default: torch.ones(win_length))

  • center (bool, optional) – Whether input was padded on both sides so that the \(t\)-th frame is centered at time \(t \times \text{hop\_length}\). (Default: True)

  • pad_mode – This argument was ignored and to be removed.

  • normalized (bool, optional) – Whether the STFT was normalized. (Default: False)

  • onesided (bool, optional) – Whether the STFT is onesided. (Default: True)

  • length (int or None, optional) – The amount to trim the signal by (i.e. the original signal length). (Default: whole signal)

Returns

Least squares estimation of the original signal of size (…, signal_length)

Return type

Tensor

spectrogram

torchaudio.functional.spectrogram(waveform: torch.Tensor, pad: int, window: torch.Tensor, n_fft: int, hop_length: int, win_length: int, power: Optional[float], normalized: bool) → torch.Tensor[source]

Create a spectrogram or a batch of spectrograms from a raw audio signal. The spectrogram can be either magnitude-only or complex.

Parameters
  • waveform (Tensor) – Tensor of audio of dimension (…, time)

  • pad (int) – Two sided padding of signal

  • window (Tensor) – Window tensor that is applied/multiplied to each frame/window

  • n_fft (int) – Size of FFT

  • hop_length (int) – Length of hop between STFT windows

  • win_length (int) – Window size

  • power (float or None) – Exponent for the magnitude spectrogram, (must be > 0) e.g., 1 for energy, 2 for power, etc. If None, then the complex spectrum is returned instead.

  • normalized (bool) – Whether to normalize by magnitude after stft

Returns

Dimension (…, freq, time), freq is n_fft // 2 + 1 and n_fft is the number of Fourier bins, and time is the number of window hops (n_frame).

Return type

Tensor

amplitude_to_DB

torchaudio.functional.amplitude_to_DB(x: torch.Tensor, multiplier: float, amin: float, db_multiplier: float, top_db: Optional[float] = None) → torch.Tensor[source]

Turn a tensor from the power/amplitude scale to the decibel scale.

This output depends on the maximum value in the input tensor, and so may return different values for an audio clip split into snippets vs. a a full clip.

Parameters
  • x (Tensor) – Input tensor before being converted to decibel scale

  • multiplier (float) – Use 10. for power and 20. for amplitude

  • amin (float) – Number to clamp x

  • db_multiplier (float) – Log10(max(reference value and amin))

  • top_db (float or None, optional) – Minimum negative cut-off in decibels. A reasonable number is 80. (Default: None)

Returns

Output tensor in decibel scale

Return type

Tensor

create_fb_matrix

torchaudio.functional.create_fb_matrix(n_freqs: int, f_min: float, f_max: float, n_mels: int, sample_rate: int, norm: Optional[str] = None) → torch.Tensor[source]

Create a frequency bin conversion matrix.

Parameters
  • n_freqs (int) – Number of frequencies to highlight/apply

  • f_min (float) – Minimum frequency (Hz)

  • f_max (float) – Maximum frequency (Hz)

  • n_mels (int) – Number of mel filterbanks

  • sample_rate (int) – Sample rate of the audio waveform

  • norm (Optional[str]) – If ‘slaney’, divide the triangular mel weights by the width of the mel band

  • normalization). (Default ((area) – None)

Returns

Triangular filter banks (fb matrix) of size (n_freqs, n_mels) meaning number of frequencies to highlight/apply to x the number of filterbanks. Each column is a filterbank so that assuming there is a matrix A of size (…, n_freqs), the applied result would be A * create_fb_matrix(A.size(-1), ...).

Return type

Tensor

create_dct

torchaudio.functional.create_dct(n_mfcc: int, n_mels: int, norm: Optional[str]) → torch.Tensor[source]

Create a DCT transformation matrix with shape (n_mels, n_mfcc), normalized depending on norm.

Parameters
  • n_mfcc (int) – Number of mfc coefficients to retain

  • n_mels (int) – Number of mel filterbanks

  • norm (str or None) – Norm to use (either ‘ortho’ or None)

Returns

The transformation matrix, to be right-multiplied to row-wise data of size (n_mels, n_mfcc).

Return type

Tensor

mu_law_encoding

torchaudio.functional.mu_law_encoding(x: torch.Tensor, quantization_channels: int) → torch.Tensor[source]

Encode signal based on mu-law companding. For more info see the Wikipedia Entry

This algorithm assumes the signal has been scaled to between -1 and 1 and returns a signal encoded with values from 0 to quantization_channels - 1.

Parameters
  • x (Tensor) – Input tensor

  • quantization_channels (int) – Number of channels

Returns

Input after mu-law encoding

Return type

Tensor

mu_law_decoding

torchaudio.functional.mu_law_decoding(x_mu: torch.Tensor, quantization_channels: int) → torch.Tensor[source]

Decode mu-law encoded signal. For more info see the Wikipedia Entry

This expects an input with values between 0 and quantization_channels - 1 and returns a signal scaled between -1 and 1.

Parameters
  • x_mu (Tensor) – Input tensor

  • quantization_channels (int) – Number of channels

Returns

Input after mu-law decoding

Return type

Tensor

complex_norm

torchaudio.functional.complex_norm(complex_tensor: torch.Tensor, power: float = 1.0) → torch.Tensor[source]

Compute the norm of complex tensor input.

Parameters
  • complex_tensor (Tensor) – Tensor shape of (…, complex=2)

  • power (float) – Power of the norm. (Default: 1.0).

Returns

Power of the normed input tensor. Shape of (…, )

Return type

Tensor

angle

torchaudio.functional.angle(complex_tensor: torch.Tensor) → torch.Tensor[source]

Compute the angle of complex tensor input.

Parameters

complex_tensor (Tensor) – Tensor shape of (…, complex=2)

Returns

Angle of a complex tensor. Shape of (…, )

Return type

Tensor

magphase

torchaudio.functional.magphase(complex_tensor: torch.Tensor, power: float = 1.0) → Tuple[torch.Tensor, torch.Tensor][source]

Separate a complex-valued spectrogram with shape (…, 2) into its magnitude and phase.

Parameters
  • complex_tensor (Tensor) – Tensor shape of (…, complex=2)

  • power (float) – Power of the norm. (Default: 1.0)

Returns

The magnitude and phase of the complex tensor

Return type

(Tensor, Tensor)

phase_vocoder

torchaudio.functional.phase_vocoder(complex_specgrams: torch.Tensor, rate: float, phase_advance: torch.Tensor) → torch.Tensor[source]

Given a STFT tensor, speed up in time without modifying pitch by a factor of rate.

Parameters
  • complex_specgrams (Tensor) – Dimension of (…, freq, time, complex=2)

  • rate (float) – Speed-up factor

  • phase_advance (Tensor) – Expected phase advance in each bin. Dimension of (freq, 1)

Returns

Complex Specgrams Stretch with dimension of (…, freq, ceil(time/rate), complex=2)

Return type

Tensor

Example
>>> freq, hop_length = 1025, 512
>>> # (channel, freq, time, complex=2)
>>> complex_specgrams = torch.randn(2, freq, 300, 2)
>>> rate = 1.3 # Speed up by 30%
>>> phase_advance = torch.linspace(
>>>    0, math.pi * hop_length, freq)[..., None]
>>> x = phase_vocoder(complex_specgrams, rate, phase_advance)
>>> x.shape # with 231 == ceil(300 / 1.3)
torch.Size([2, 1025, 231, 2])

lfilter

torchaudio.functional.lfilter(waveform: torch.Tensor, a_coeffs: torch.Tensor, b_coeffs: torch.Tensor, clamp: bool = True) → torch.Tensor[source]

Perform an IIR filter by evaluating difference equation.

Parameters
  • waveform (Tensor) – audio waveform of dimension of (..., time). Must be normalized to -1 to 1.

  • a_coeffs (Tensor) – denominator coefficients of difference equation of dimension of (n_order + 1). Lower delays coefficients are first, e.g. [a0, a1, a2, ...]. Must be same size as b_coeffs (pad with 0’s as necessary).

  • b_coeffs (Tensor) – numerator coefficients of difference equation of dimension of (n_order + 1). Lower delays coefficients are first, e.g. [b0, b1, b2, ...]. Must be same size as a_coeffs (pad with 0’s as necessary).

  • clamp (bool, optional) – If True, clamp the output signal to be in the range [-1, 1] (Default: True)

Returns

Waveform with dimension of (..., time).

Return type

Tensor

biquad

torchaudio.functional.biquad(waveform: torch.Tensor, b0: float, b1: float, b2: float, a0: float, a1: float, a2: float) → torch.Tensor[source]

Perform a biquad filter of input tensor. Initial conditions set to 0. https://en.wikipedia.org/wiki/Digital_biquad_filter

Parameters
  • waveform (Tensor) – audio waveform of dimension of (…, time)

  • b0 (float) – numerator coefficient of current input, x[n]

  • b1 (float) – numerator coefficient of input one time step ago x[n-1]

  • b2 (float) – numerator coefficient of input two time steps ago x[n-2]

  • a0 (float) – denominator coefficient of current output y[n], typically 1

  • a1 (float) – denominator coefficient of current output y[n-1]

  • a2 (float) – denominator coefficient of current output y[n-2]

Returns

Waveform with dimension of (…, time)

Return type

Tensor

lowpass_biquad

torchaudio.functional.lowpass_biquad(waveform: torch.Tensor, sample_rate: int, cutoff_freq: float, Q: float = 0.707) → torch.Tensor[source]

Design biquad lowpass filter and perform filtering. Similar to SoX implementation.

Parameters
  • waveform (torch.Tensor) – audio waveform of dimension of (…, time)

  • sample_rate (int) – sampling rate of the waveform, e.g. 44100 (Hz)

  • cutoff_freq (float) – filter cutoff frequency

  • Q (float, optional) – https://en.wikipedia.org/wiki/Q_factor (Default: 0.707)

Returns

Waveform of dimension of (…, time)

Return type

Tensor

highpass_biquad

torchaudio.functional.highpass_biquad(waveform: torch.Tensor, sample_rate: int, cutoff_freq: float, Q: float = 0.707) → torch.Tensor[source]

Design biquad highpass filter and perform filtering. Similar to SoX implementation.

Parameters
  • waveform (Tensor) – audio waveform of dimension of (…, time)

  • sample_rate (int) – sampling rate of the waveform, e.g. 44100 (Hz)

  • cutoff_freq (float) – filter cutoff frequency

  • Q (float, optional) – https://en.wikipedia.org/wiki/Q_factor (Default: 0.707)

Returns

Waveform dimension of (…, time)

Return type

Tensor

allpass_biquad

torchaudio.functional.allpass_biquad(waveform: torch.Tensor, sample_rate: int, central_freq: float, Q: float = 0.707) → torch.Tensor[source]

Design two-pole all-pass filter. Similar to SoX implementation.

Parameters
  • waveform (torch.Tensor) – audio waveform of dimension of (…, time)

  • sample_rate (int) – sampling rate of the waveform, e.g. 44100 (Hz)

  • central_freq (float) – central frequency (in Hz)

  • Q (float, optional) – https://en.wikipedia.org/wiki/Q_factor (Default: 0.707)

Returns

Waveform of dimension of (…, time)

Return type

Tensor

References

http://sox.sourceforge.net/sox.html https://www.w3.org/2011/audio/audio-eq-cookbook.html#APF

equalizer_biquad

torchaudio.functional.equalizer_biquad(waveform: torch.Tensor, sample_rate: int, center_freq: float, gain: float, Q: float = 0.707) → torch.Tensor[source]

Design biquad peaking equalizer filter and perform filtering. Similar to SoX implementation.

Parameters
  • waveform (Tensor) – audio waveform of dimension of (…, time)

  • sample_rate (int) – sampling rate of the waveform, e.g. 44100 (Hz)

  • center_freq (float) – filter’s central frequency

  • gain (float) – desired gain at the boost (or attenuation) in dB

  • Q (float, optional) – https://en.wikipedia.org/wiki/Q_factor (Default: 0.707)

Returns

Waveform of dimension of (…, time)

Return type

Tensor

bandpass_biquad

torchaudio.functional.bandpass_biquad(waveform: torch.Tensor, sample_rate: int, central_freq: float, Q: float = 0.707, const_skirt_gain: bool = False) → torch.Tensor[source]

Design two-pole band-pass filter. Similar to SoX implementation.

Parameters
  • waveform (Tensor) – audio waveform of dimension of (…, time)

  • sample_rate (int) – sampling rate of the waveform, e.g. 44100 (Hz)

  • central_freq (float) – central frequency (in Hz)

  • Q (float, optional) – https://en.wikipedia.org/wiki/Q_factor (Default: 0.707)

  • const_skirt_gain (bool, optional) – If True, uses a constant skirt gain (peak gain = Q). If False, uses a constant 0dB peak gain. (Default: False)

Returns

Waveform of dimension of (…, time)

Return type

Tensor

References

http://sox.sourceforge.net/sox.html https://www.w3.org/2011/audio/audio-eq-cookbook.html#APF

bandreject_biquad

torchaudio.functional.bandreject_biquad(waveform: torch.Tensor, sample_rate: int, central_freq: float, Q: float = 0.707) → torch.Tensor[source]

Design two-pole band-reject filter. Similar to SoX implementation.

Parameters
  • waveform (Tensor) – audio waveform of dimension of (…, time)

  • sample_rate (int) – sampling rate of the waveform, e.g. 44100 (Hz)

  • central_freq (float) – central frequency (in Hz)

  • Q (float, optional) – https://en.wikipedia.org/wiki/Q_factor (Default: 0.707)

Returns

Waveform of dimension of (…, time)

Return type

Tensor

References

http://sox.sourceforge.net/sox.html https://www.w3.org/2011/audio/audio-eq-cookbook.html#APF

band_biquad

torchaudio.functional.band_biquad(waveform: torch.Tensor, sample_rate: int, central_freq: float, Q: float = 0.707, noise: bool = False) → torch.Tensor[source]

Design two-pole band filter. Similar to SoX implementation.

Parameters
  • waveform (Tensor) – audio waveform of dimension of (…, time)

  • sample_rate (int) – sampling rate of the waveform, e.g. 44100 (Hz)

  • central_freq (float) – central frequency (in Hz)

  • Q (float, optional) – https://en.wikipedia.org/wiki/Q_factor (Default: 0.707).

  • noise (bool, optional) – If True, uses the alternate mode for un-pitched audio (e.g. percussion). If False, uses mode oriented to pitched audio, i.e. voice, singing, or instrumental music (Default: False).

Returns

Waveform of dimension of (…, time)

Return type

Tensor

References

http://sox.sourceforge.net/sox.html https://www.w3.org/2011/audio/audio-eq-cookbook.html#APF

treble_biquad

torchaudio.functional.treble_biquad(waveform: torch.Tensor, sample_rate: int, gain: float, central_freq: float = 3000, Q: float = 0.707) → torch.Tensor[source]

Design a treble tone-control effect. Similar to SoX implementation.

Parameters
  • waveform (Tensor) – audio waveform of dimension of (…, time)

  • sample_rate (int) – sampling rate of the waveform, e.g. 44100 (Hz)

  • gain (float) – desired gain at the boost (or attenuation) in dB.

  • central_freq (float, optional) – central frequency (in Hz). (Default: 3000)

  • Q (float, optional) – https://en.wikipedia.org/wiki/Q_factor (Default: 0.707).

Returns

Waveform of dimension of (…, time)

Return type

Tensor

References

http://sox.sourceforge.net/sox.html https://www.w3.org/2011/audio/audio-eq-cookbook.html#APF

bass_biquad

torchaudio.functional.bass_biquad(waveform: torch.Tensor, sample_rate: int, gain: float, central_freq: float = 100, Q: float = 0.707) → torch.Tensor[source]

Design a bass tone-control effect. Similar to SoX implementation.

Parameters
  • waveform (Tensor) – audio waveform of dimension of (…, time)

  • sample_rate (int) – sampling rate of the waveform, e.g. 44100 (Hz)

  • gain (float) – desired gain at the boost (or attenuation) in dB.

  • central_freq (float, optional) – central frequency (in Hz). (Default: 100)

  • Q (float, optional) – https://en.wikipedia.org/wiki/Q_factor (Default: 0.707).

Returns

Waveform of dimension of (…, time)

Return type

Tensor

References

http://sox.sourceforge.net/sox.html https://www.w3.org/2011/audio/audio-eq-cookbook.html#APF

deemph_biquad

torchaudio.functional.deemph_biquad(waveform: torch.Tensor, sample_rate: int) → torch.Tensor[source]

Apply ISO 908 CD de-emphasis (shelving) IIR filter. Similar to SoX implementation.

Parameters
  • waveform (Tensor) – audio waveform of dimension of (…, time)

  • sample_rate (int) – sampling rate of the waveform, Allowed sample rate 44100 or 48000

Returns

Waveform of dimension of (…, time)

Return type

Tensor

References

http://sox.sourceforge.net/sox.html https://www.w3.org/2011/audio/audio-eq-cookbook.html#APF

riaa_biquad

torchaudio.functional.riaa_biquad(waveform: torch.Tensor, sample_rate: int) → torch.Tensor[source]

Apply RIAA vinyl playback equalisation. Similar to SoX implementation.

Parameters
  • waveform (Tensor) – audio waveform of dimension of (…, time)

  • sample_rate (int) – sampling rate of the waveform, e.g. 44100 (Hz). Allowed sample rates in Hz : 44100,``48000``,``88200``,``96000``

Returns

Waveform of dimension of (…, time)

Return type

Tensor

References

http://sox.sourceforge.net/sox.html https://www.w3.org/2011/audio/audio-eq-cookbook.html#APF

contrast

torchaudio.functional.contrast(waveform: torch.Tensor, enhancement_amount: float = 75.0) → torch.Tensor[source]

Apply contrast effect. Similar to SoX implementation. Comparable with compression, this effect modifies an audio signal to make it sound louder

Parameters
  • waveform (Tensor) – audio waveform of dimension of (…, time)

  • enhancement_amount (float) – controls the amount of the enhancement Allowed range of values for enhancement_amount : 0-100 Note that enhancement_amount = 0 still gives a significant contrast enhancement

Returns

Waveform of dimension of (…, time)

Return type

Tensor

References

http://sox.sourceforge.net/sox.html

dcshift

torchaudio.functional.dcshift(waveform: torch.Tensor, shift: float, limiter_gain: Optional[float] = None) → torch.Tensor[source]

Apply a DC shift to the audio. Similar to SoX implementation. This can be useful to remove a DC offset (caused perhaps by a hardware problem in the recording chain) from the audio

Parameters
  • waveform (Tensor) – audio waveform of dimension of (…, time)

  • shift (float) – indicates the amount to shift the audio Allowed range of values for shift : -2.0 to +2.0

  • limiter_gain (float) – It is used only on peaks to prevent clipping It should have a value much less than 1 (e.g. 0.05 or 0.02)

Returns

Waveform of dimension of (…, time)

Return type

Tensor

References

http://sox.sourceforge.net/sox.html

overdrive

torchaudio.functional.overdrive(waveform: torch.Tensor, gain: float = 20, colour: float = 20) → torch.Tensor[source]

Apply a overdrive effect to the audio. Similar to SoX implementation. This effect applies a non linear distortion to the audio signal.

Parameters
  • waveform (Tensor) – audio waveform of dimension of (…, time)

  • gain (float) – desired gain at the boost (or attenuation) in dB Allowed range of values are 0 to 100

  • colour (float) – controls the amount of even harmonic content in the over-driven output Allowed range of values are 0 to 100

Returns

Waveform of dimension of (…, time)

Return type

Tensor

References

http://sox.sourceforge.net/sox.html

phaser

torchaudio.functional.phaser(waveform: torch.Tensor, sample_rate: int, gain_in: float = 0.4, gain_out: float = 0.74, delay_ms: float = 3.0, decay: float = 0.4, mod_speed: float = 0.5, sinusoidal: bool = True) → torch.Tensor[source]

Apply a phasing effect to the audio. Similar to SoX implementation.

Parameters
  • waveform (Tensor) – audio waveform of dimension of (…, time)

  • sample_rate (int) – sampling rate of the waveform, e.g. 44100 (Hz)

  • gain_in (float) – desired input gain at the boost (or attenuation) in dB Allowed range of values are 0 to 1

  • gain_out (float) – desired output gain at the boost (or attenuation) in dB Allowed range of values are 0 to 1e9

  • delay_ms (float) – desired delay in milli seconds Allowed range of values are 0 to 5.0

  • decay (float) – desired decay relative to gain-in Allowed range of values are 0 to 0.99

  • mod_speed (float) – modulation speed in Hz Allowed range of values are 0.1 to 2

  • sinusoidal (bool) – If True, uses sinusoidal modulation (preferable for multiple instruments) If False, uses triangular modulation (gives single instruments a sharper phasing effect) (Default: True)

Returns

Waveform of dimension of (…, time)

Return type

Tensor

References

http://sox.sourceforge.net/sox.html Scott Lehman, Effects Explained, http://harmony-central.com/Effects/effects-explained.html

flanger

torchaudio.functional.flanger(waveform: torch.Tensor, sample_rate: int, delay: float = 0.0, depth: float = 2.0, regen: float = 0.0, width: float = 71.0, speed: float = 0.5, phase: float = 25.0, modulation: str = 'sinusoidal', interpolation: str = 'linear') → torch.Tensor[source]

Apply a flanger effect to the audio. Similar to SoX implementation.

Parameters
  • waveform (Tensor) – audio waveform of dimension of (…, channel, time) . Max 4 channels allowed

  • sample_rate (int) – sampling rate of the waveform, e.g. 44100 (Hz)

  • delay (float) – desired delay in milliseconds(ms) Allowed range of values are 0 to 30

  • depth (float) – desired delay depth in milliseconds(ms) Allowed range of values are 0 to 10

  • regen (float) – desired regen(feeback gain) in dB Allowed range of values are -95 to 95

  • width (float) – desired width(delay gain) in dB Allowed range of values are 0 to 100

  • speed (float) – modulation speed in Hz Allowed range of values are 0.1 to 10

  • phase (float) – percentage phase-shift for multi-channel Allowed range of values are 0 to 100

  • modulation (str) – Use either “sinusoidal” or “triangular” modulation. (Default: sinusoidal)

  • interpolation (str) – Use either “linear” or “quadratic” for delay-line interpolation. (Default: linear)

Returns

Waveform of dimension of (…, channel, time)

Return type

Tensor

References

http://sox.sourceforge.net/sox.html

Scott Lehman, Effects Explained, https://web.archive.org/web/20051125072557/http://www.harmony-central.com/Effects/effects-explained.html

mask_along_axis

torchaudio.functional.mask_along_axis(specgram: torch.Tensor, mask_param: int, mask_value: float, axis: int) → torch.Tensor[source]

Apply a mask along axis. Mask will be applied from indices [v_0, v_0 + v), where v is sampled from uniform(0, mask_param), and v_0 from uniform(0, max_v - v). All examples will have the same mask interval.

Parameters
  • specgram (Tensor) – Real spectrogram (channel, freq, time)

  • mask_param (int) – Number of columns to be masked will be uniformly sampled from [0, mask_param]

  • mask_value (float) – Value to assign to the masked columns

  • axis (int) – Axis to apply masking on (1 -> frequency, 2 -> time)

Returns

Masked spectrogram of dimensions (channel, freq, time)

Return type

Tensor

mask_along_axis_iid

torchaudio.functional.mask_along_axis_iid(specgrams: torch.Tensor, mask_param: int, mask_value: float, axis: int) → torch.Tensor[source]

Apply a mask along axis. Mask will be applied from indices [v_0, v_0 + v), where v is sampled from uniform(0, mask_param), and v_0 from uniform(0, max_v - v).

Parameters
  • specgrams (Tensor) – Real spectrograms (batch, channel, freq, time)

  • mask_param (int) – Number of columns to be masked will be uniformly sampled from [0, mask_param]

  • mask_value (float) – Value to assign to the masked columns

  • axis (int) – Axis to apply masking on (2 -> frequency, 3 -> time)

Returns

Masked spectrograms of dimensions (batch, channel, freq, time)

Return type

Tensor

compute_deltas

torchaudio.functional.compute_deltas(specgram: torch.Tensor, win_length: int = 5, mode: str = 'replicate') → torch.Tensor[source]

Compute delta coefficients of a tensor, usually a spectrogram:

\[d_t = \frac{\sum_{n=1}^{\text{N}} n (c_{t+n} - c_{t-n})}{2 \sum_{n=1}^{\text{N}} n^2} \]

where \(d_t\) is the deltas at time \(t\), \(c_t\) is the spectrogram coeffcients at time \(t\), \(N\) is (win_length-1)//2.

Parameters
  • specgram (Tensor) – Tensor of audio of dimension (…, freq, time)

  • win_length (int, optional) – The window length used for computing delta (Default: 5)

  • mode (str, optional) – Mode parameter passed to padding (Default: "replicate")

Returns

Tensor of deltas of dimension (…, freq, time)

Return type

Tensor

Example
>>> specgram = torch.randn(1, 40, 1000)
>>> delta = compute_deltas(specgram)
>>> delta2 = compute_deltas(delta)

detect_pitch_frequency

torchaudio.functional.detect_pitch_frequency(waveform: torch.Tensor, sample_rate: int, frame_time: float = 0.01, win_length: int = 30, freq_low: int = 85, freq_high: int = 3400) → torch.Tensor[source]

Detect pitch frequency.

It is implemented using normalized cross-correlation function and median smoothing.

Parameters
  • waveform (Tensor) – Tensor of audio of dimension (…, freq, time)

  • sample_rate (int) – The sample rate of the waveform (Hz)

  • frame_time (float, optional) – Duration of a frame (Default: 10 ** (-2)).

  • win_length (int, optional) – The window length for median smoothing (in number of frames) (Default: 30).

  • freq_low (int, optional) – Lowest frequency that can be detected (Hz) (Default: 85).

  • freq_high (int, optional) – Highest frequency that can be detected (Hz) (Default: 3400).

Returns

Tensor of freq of dimension (…, frame)

Return type

Tensor

sliding_window_cmn

torchaudio.functional.sliding_window_cmn(waveform: torch.Tensor, cmn_window: int = 600, min_cmn_window: int = 100, center: bool = False, norm_vars: bool = False) → torch.Tensor[source]

Apply sliding-window cepstral mean (and optionally variance) normalization per utterance.

Parameters
  • waveform (Tensor) – Tensor of audio of dimension (…, freq, time)

  • cmn_window (int, optional) – Window in frames for running average CMN computation (int, default = 600)

  • min_cmn_window (int, optional) – Minimum CMN window used at start of decoding (adds latency only at start). Only applicable if center == false, ignored if center==true (int, default = 100)

  • center (bool, optional) – If true, use a window centered on the current frame (to the extent possible, modulo end effects). If false, window is to the left. (bool, default = false)

  • norm_vars (bool, optional) – If true, normalize variance to one. (bool, default = false)

Returns

Tensor of freq of dimension (…, frame)

Return type

Tensor

vad

torchaudio.functional.vad(waveform: torch.Tensor, sample_rate: int, trigger_level: float = 7.0, trigger_time: float = 0.25, search_time: float = 1.0, allowed_gap: float = 0.25, pre_trigger_time: float = 0.0, boot_time: float = 0.35, noise_up_time: float = 0.1, noise_down_time: float = 0.01, noise_reduction_amount: float = 1.35, measure_freq: float = 20.0, measure_duration: Optional[float] = None, measure_smooth_time: float = 0.4, hp_filter_freq: float = 50.0, lp_filter_freq: float = 6000.0, hp_lifter_freq: float = 150.0, lp_lifter_freq: float = 2000.0) → torch.Tensor[source]

Voice Activity Detector. Similar to SoX implementation. Attempts to trim silence and quiet background sounds from the ends of recordings of speech. The algorithm currently uses a simple cepstral power measurement to detect voice, so may be fooled by other things, especially music.

The effect can trim only from the front of the audio, so in order to trim from the back, the reverse effect must also be used.

Parameters
  • waveform (Tensor) – Tensor of audio of dimension (…, time)

  • sample_rate (int) – Sample rate of audio signal.

  • trigger_level (float, optional) – The measurement level used to trigger activity detection. This may need to be cahnged depending on the noise level, signal level, and other characteristics of the input audio. (Default: 7.0)

  • trigger_time (float, optional) – The time constant (in seconds) used to help ignore short bursts of sound. (Default: 0.25)

  • search_time (float, optional) – The amount of audio (in seconds) to search for quieter/shorter bursts of audio to include prior to the detected trigger point. (Default: 1.0)

  • allowed_gap (float, optional) – The allowed gap (in seconds) between quiteter/shorter bursts of audio to include prior to the detected trigger point. (Default: 0.25)

  • pre_trigger_time (float, optional) – The amount of audio (in seconds) to preserve before the trigger point and any found quieter/shorter bursts. (Default: 0.0)

  • boot_time (float, optional) The algorithm (internally) – estimation/reduction in order to detect the start of the wanted audio. This option sets the time for the initial noise estimate. (Default: 0.35)

  • noise_up_time (float, optional) – for when the noise level is increasing. (Default: 0.1)

  • noise_down_time (float, optional) – for when the noise level is decreasing. (Default: 0.01)

  • noise_reduction_amount (float, optional) – the detection algorithm (e.g. 0, 0.5, …). (Default: 1.35)

  • measure_freq (float, optional) – processing/measurements. (Default: 20.0)

  • measure_duration – (float, optional) Measurement duration. (Default: Twice the measurement period; i.e. with overlap.)

  • measure_smooth_time (float, optional) – spectral measurements. (Default: 0.4)

  • hp_filter_freq (float, optional) – at the input to the detector algorithm. (Default: 50.0)

  • lp_filter_freq (float, optional) – at the input to the detector algorithm. (Default: 6000.0)

  • hp_lifter_freq (float, optional) – in the detector algorithm. (Default: 150.0)

  • lp_lifter_freq (float, optional) – in the detector algorithm. (Default: 2000.0)

Returns

Tensor of audio of dimension (…, time).

Return type

Tensor

References

http://sox.sourceforge.net/sox.html

Docs

Access comprehensive developer documentation for PyTorch

View Docs

Tutorials

Get in-depth tutorials for beginners and advanced developers

View Tutorials

Resources

Find development resources and get your questions answered

View Resources