• Docs >
  • torchaudio.transforms
Shortcuts

torchaudio.transforms

Transforms are common audio transforms. They can be chained together using torch.nn.Sequential

Spectrogram

class torchaudio.transforms.Spectrogram(n_fft: int = 400, win_length: Optional[int] = None, hop_length: Optional[int] = None, pad: int = 0, window_fn: Callable[[...], torch.Tensor] = <built-in method hann_window of type object>, power: Optional[float] = 2.0, normalized: bool = False, wkwargs: Optional[dict] = None)[source]

Create a spectrogram from a audio signal.

Parameters
  • n_fft (int, optional) – Size of FFT, creates n_fft // 2 + 1 bins. (Default: 400)

  • win_length (int or None, optional) – Window size. (Default: n_fft)

  • hop_length (int or None, optional) – Length of hop between STFT windows. (Default: win_length // 2)

  • pad (int, optional) – Two sided padding of signal. (Default: 0)

  • window_fn (Callable[.., Tensor], optional) – A function to create a window tensor that is applied/multiplied to each frame/window. (Default: torch.hann_window)

  • power (float or None, optional) – Exponent for the magnitude spectrogram, (must be > 0) e.g., 1 for energy, 2 for power, etc. If None, then the complex spectrum is returned instead. (Default: 2)

  • normalized (bool, optional) – Whether to normalize by magnitude after stft. (Default: False)

  • wkwargs (dict or None, optional) – Arguments for window function. (Default: None)

forward(waveform: torch.Tensor) → torch.Tensor[source]
Parameters

waveform (Tensor) – Tensor of audio of dimension (…, time).

Returns

Dimension (…, freq, time), where freq is n_fft // 2 + 1 where n_fft is the number of Fourier bins, and time is the number of window hops (n_frame).

Return type

Tensor

GriffinLim

class torchaudio.transforms.GriffinLim(n_fft: int = 400, n_iter: int = 32, win_length: Optional[int] = None, hop_length: Optional[int] = None, window_fn: Callable[[...], torch.Tensor] = <built-in method hann_window of type object>, power: float = 2.0, normalized: bool = False, wkwargs: Optional[dict] = None, momentum: float = 0.99, length: Optional[int] = None, rand_init: bool = True)[source]

Compute waveform from a linear scale magnitude spectrogram using the Griffin-Lim transformation.

Implementation ported from librosa 1, 2, 3.

Parameters
  • n_fft (int, optional) – Size of FFT, creates n_fft // 2 + 1 bins. (Default: 400)

  • n_iter (int, optional) – Number of iteration for phase recovery process. (Default: 32)

  • win_length (int or None, optional) – Window size. (Default: n_fft)

  • hop_length (int or None, optional) – Length of hop between STFT windows. (Default: win_length // 2)

  • window_fn (Callable[.., Tensor], optional) – A function to create a window tensor that is applied/multiplied to each frame/window. (Default: torch.hann_window)

  • power (float, optional) – Exponent for the magnitude spectrogram, (must be > 0) e.g., 1 for energy, 2 for power, etc. (Default: 2)

  • normalized (bool, optional) – Whether to normalize by magnitude after stft. (Default: False)

  • wkwargs (dict or None, optional) – Arguments for window function. (Default: None)

  • momentum (float, optional) – The momentum parameter for fast Griffin-Lim. Setting this to 0 recovers the original Griffin-Lim method. Values near 1 can lead to faster convergence, but above 1 may not converge. (Default: 0.99)

  • length (int, optional) – Array length of the expected output. (Default: None)

  • rand_init (bool, optional) – Initializes phase randomly if True and to zero otherwise. (Default: True)

References

1
McFee, Brian, Colin Raffel, Dawen Liang, Daniel PW Ellis, Matt McVicar, Eric Battenberg, and Oriol Nieto.
“librosa: Audio and music signal analysis in python.”
In Proceedings of the 14th python in science conference, pp. 18-25. 2015.
2
Perraudin, N., Balazs, P., & Søndergaard, P. L.
“A fast Griffin-Lim algorithm,”
IEEE Workshop on Applications of Signal Processing to Audio and Acoustics (pp. 1-4),
Oct. 2013.
3
D. W. Griffin and J. S. Lim,
“Signal estimation from modified short-time Fourier transform,”
IEEE Trans. ASSP, vol.32, no.2, pp.236–243, Apr. 1984.
forward(specgram: torch.Tensor) → torch.Tensor[source]
Parameters

specgram (Tensor) – A magnitude-only STFT spectrogram of dimension (…, freq, frames) where freq is n_fft // 2 + 1.

Returns

waveform of (…, time), where time equals the length parameter if given.

Return type

Tensor

AmplitudeToDB

class torchaudio.transforms.AmplitudeToDB(stype: str = 'power', top_db: Optional[float] = None)[source]

Turn a tensor from the power/amplitude scale to the decibel scale.

This output depends on the maximum value in the input tensor, and so may return different values for an audio clip split into snippets vs. a a full clip.

Parameters
  • stype (str, optional) – scale of input tensor (‘power’ or ‘magnitude’). The power being the elementwise square of the magnitude. (Default: 'power')

  • top_db (float, optional) – minimum negative cut-off in decibels. A reasonable number is 80. (Default: None)

forward(x: torch.Tensor) → torch.Tensor[source]

Numerically stable implementation from Librosa.

https://librosa.org/doc/latest/generated/librosa.amplitude_to_db.html

Parameters

x (Tensor) – Input tensor before being converted to decibel scale.

Returns

Output tensor in decibel scale.

Return type

Tensor

MelScale

class torchaudio.transforms.MelScale(n_mels: int = 128, sample_rate: int = 16000, f_min: float = 0.0, f_max: Optional[float] = None, n_stft: Optional[int] = None)[source]

Turn a normal STFT into a mel frequency STFT, using a conversion matrix. This uses triangular filter banks.

User can control which device the filter bank (fb) is (e.g. fb.to(spec_f.device)).

Parameters
  • n_mels (int, optional) – Number of mel filterbanks. (Default: 128)

  • sample_rate (int, optional) – Sample rate of audio signal. (Default: 16000)

  • f_min (float, optional) – Minimum frequency. (Default: 0.)

  • f_max (float or None, optional) – Maximum frequency. (Default: sample_rate // 2)

  • n_stft (int, optional) – Number of bins in STFT. Calculated from first input if None is given. See n_fft in Spectrogram. (Default: None)

forward(specgram: torch.Tensor) → torch.Tensor[source]
Parameters

specgram (Tensor) – A spectrogram STFT of dimension (…, freq, time).

Returns

Mel frequency spectrogram of size (…, n_mels, time).

Return type

Tensor

InverseMelScale

class torchaudio.transforms.InverseMelScale(n_stft: int, n_mels: int = 128, sample_rate: int = 16000, f_min: float = 0.0, f_max: Optional[float] = None, max_iter: int = 100000, tolerance_loss: float = 1e-05, tolerance_change: float = 1e-08, sgdargs: Optional[dict] = None)[source]

Solve for a normal STFT from a mel frequency STFT, using a conversion matrix. This uses triangular filter banks.

It minimizes the euclidian norm between the input mel-spectrogram and the product between the estimated spectrogram and the filter banks using SGD.

Parameters
  • n_stft (int) – Number of bins in STFT. See n_fft in Spectrogram.

  • n_mels (int, optional) – Number of mel filterbanks. (Default: 128)

  • sample_rate (int, optional) – Sample rate of audio signal. (Default: 16000)

  • f_min (float, optional) – Minimum frequency. (Default: 0.)

  • f_max (float or None, optional) – Maximum frequency. (Default: sample_rate // 2)

  • max_iter (int, optional) – Maximum number of optimization iterations. (Default: 100000)

  • tolerance_loss (float, optional) – Value of loss to stop optimization at. (Default: 1e-5)

  • tolerance_change (float, optional) – Difference in losses to stop optimization at. (Default: 1e-8)

  • sgdargs (dict or None, optional) – Arguments for the SGD optimizer. (Default: None)

forward(melspec: torch.Tensor) → torch.Tensor[source]
Parameters

melspec (Tensor) – A Mel frequency spectrogram of dimension (…, n_mels, time)

Returns

Linear scale spectrogram of size (…, freq, time)

Return type

Tensor

MelSpectrogram

class torchaudio.transforms.MelSpectrogram(sample_rate: int = 16000, n_fft: int = 400, win_length: Optional[int] = None, hop_length: Optional[int] = None, f_min: float = 0.0, f_max: Optional[float] = None, pad: int = 0, n_mels: int = 128, window_fn: Callable[[...], torch.Tensor] = <built-in method hann_window of type object>, power: Optional[float] = 2.0, normalized: bool = False, wkwargs: Optional[dict] = None)[source]

Create MelSpectrogram for a raw audio signal. This is a composition of Spectrogram and MelScale.

Sources
Parameters
  • sample_rate (int, optional) – Sample rate of audio signal. (Default: 16000)

  • win_length (int or None, optional) – Window size. (Default: n_fft)

  • hop_length (int or None, optional) – Length of hop between STFT windows. (Default: win_length // 2)

  • n_fft (int, optional) – Size of FFT, creates n_fft // 2 + 1 bins. (Default: 400)

  • f_min (float, optional) – Minimum frequency. (Default: 0.)

  • f_max (float or None, optional) – Maximum frequency. (Default: None)

  • pad (int, optional) – Two sided padding of signal. (Default: 0)

  • n_mels (int, optional) – Number of mel filterbanks. (Default: 128)

  • window_fn (Callable[.., Tensor], optional) – A function to create a window tensor that is applied/multiplied to each frame/window. (Default: torch.hann_window)

  • wkwargs (Dict[.., ..] or None, optional) – Arguments for window function. (Default: None)

Example
>>> waveform, sample_rate = torchaudio.load('test.wav', normalization=True)
>>> mel_specgram = transforms.MelSpectrogram(sample_rate)(waveform)  # (channel, n_mels, time)
forward(waveform: torch.Tensor) → torch.Tensor[source]
Parameters

waveform (Tensor) – Tensor of audio of dimension (…, time).

Returns

Mel frequency spectrogram of size (…, n_mels, time).

Return type

Tensor

MFCC

class torchaudio.transforms.MFCC(sample_rate: int = 16000, n_mfcc: int = 40, dct_type: int = 2, norm: str = 'ortho', log_mels: bool = False, melkwargs: Optional[dict] = None)[source]

Create the Mel-frequency cepstrum coefficients from an audio signal.

By default, this calculates the MFCC on the DB-scaled Mel spectrogram. This is not the textbook implementation, but is implemented here to give consistency with librosa.

This output depends on the maximum value in the input spectrogram, and so may return different values for an audio clip split into snippets vs. a a full clip.

Parameters
  • sample_rate (int, optional) – Sample rate of audio signal. (Default: 16000)

  • n_mfcc (int, optional) – Number of mfc coefficients to retain. (Default: 40)

  • dct_type (int, optional) – type of DCT (discrete cosine transform) to use. (Default: 2)

  • norm (str, optional) – norm to use. (Default: 'ortho')

  • log_mels (bool, optional) – whether to use log-mel spectrograms instead of db-scaled. (Default: False)

  • melkwargs (dict or None, optional) – arguments for MelSpectrogram. (Default: None)

forward(waveform: torch.Tensor) → torch.Tensor[source]
Parameters

waveform (Tensor) – Tensor of audio of dimension (…, time).

Returns

specgram_mel_db of size (…, n_mfcc, time).

Return type

Tensor

MuLawEncoding

class torchaudio.transforms.MuLawEncoding(quantization_channels: int = 256)[source]

Encode signal based on mu-law companding. For more info see the Wikipedia Entry

This algorithm assumes the signal has been scaled to between -1 and 1 and returns a signal encoded with values from 0 to quantization_channels - 1

Parameters

quantization_channels (int, optional) – Number of channels. (Default: 256)

forward(x: torch.Tensor) → torch.Tensor[source]
Parameters

x (Tensor) – A signal to be encoded.

Returns

An encoded signal.

Return type

x_mu (Tensor)

MuLawDecoding

class torchaudio.transforms.MuLawDecoding(quantization_channels: int = 256)[source]

Decode mu-law encoded signal. For more info see the Wikipedia Entry

This expects an input with values between 0 and quantization_channels - 1 and returns a signal scaled between -1 and 1.

Parameters

quantization_channels (int, optional) – Number of channels. (Default: 256)

forward(x_mu: torch.Tensor) → torch.Tensor[source]
Parameters

x_mu (Tensor) – A mu-law encoded signal which needs to be decoded.

Returns

The signal decoded.

Return type

Tensor

Resample

class torchaudio.transforms.Resample(orig_freq: int = 16000, new_freq: int = 16000, resampling_method: str = 'sinc_interpolation')[source]

Resample a signal from one frequency to another. A resampling method can be given.

Parameters
  • orig_freq (float, optional) – The original frequency of the signal. (Default: 16000)

  • new_freq (float, optional) – The desired frequency. (Default: 16000)

  • resampling_method (str, optional) – The resampling method. (Default: 'sinc_interpolation')

forward(waveform: torch.Tensor) → torch.Tensor[source]
Parameters

waveform (Tensor) – Tensor of audio of dimension (…, time).

Returns

Output signal of dimension (…, time).

Return type

Tensor

ComplexNorm

class torchaudio.transforms.ComplexNorm(power: float = 1.0)[source]

Compute the norm of complex tensor input.

Parameters

power (float, optional) – Power of the norm. (Default: to 1.0)

forward(complex_tensor: torch.Tensor) → torch.Tensor[source]
Parameters

complex_tensor (Tensor) – Tensor shape of (…, complex=2).

Returns

norm of the input tensor, shape of (…, ).

Return type

Tensor

ComputeDeltas

class torchaudio.transforms.ComputeDeltas(win_length: int = 5, mode: str = 'replicate')[source]

Compute delta coefficients of a tensor, usually a spectrogram.

See torchaudio.functional.compute_deltas for more details.

Parameters
  • win_length (int) – The window length used for computing delta. (Default: 5)

  • mode (str) – Mode parameter passed to padding. (Default: 'replicate')

forward(specgram: torch.Tensor) → torch.Tensor[source]
Parameters

specgram (Tensor) – Tensor of audio of dimension (…, freq, time).

Returns

Tensor of deltas of dimension (…, freq, time).

Return type

Tensor

TimeStretch

class torchaudio.transforms.TimeStretch(hop_length: Optional[int] = None, n_freq: int = 201, fixed_rate: Optional[float] = None)[source]

Stretch stft in time without modifying pitch for a given rate.

Parameters
  • hop_length (int or None, optional) – Length of hop between STFT windows. (Default: win_length // 2)

  • n_freq (int, optional) – number of filter banks from stft. (Default: 201)

  • fixed_rate (float or None, optional) – rate to speed up or slow down by. If None is provided, rate must be passed to the forward method. (Default: None)

forward(complex_specgrams: torch.Tensor, overriding_rate: Optional[float] = None) → torch.Tensor[source]
Parameters
  • complex_specgrams (Tensor) – complex spectrogram (…, freq, time, complex=2).

  • overriding_rate (float or None, optional) – speed up to apply to this batch. If no rate is passed, use self.fixed_rate. (Default: None)

Returns

Stretched complex spectrogram of dimension (…, freq, ceil(time/rate), complex=2).

Return type

Tensor

Fade

class torchaudio.transforms.Fade(fade_in_len: int = 0, fade_out_len: int = 0, fade_shape: str = 'linear')[source]

Add a fade in and/or fade out to an waveform.

Parameters
  • fade_in_len (int, optional) – Length of fade-in (time frames). (Default: 0)

  • fade_out_len (int, optional) – Length of fade-out (time frames). (Default: 0)

  • fade_shape (str, optional) – Shape of fade. Must be one of: “quarter_sine”, “half_sine”, “linear”, “logarithmic”, “exponential”. (Default: "linear")

forward(waveform: torch.Tensor) → torch.Tensor[source]
Parameters

waveform (Tensor) – Tensor of audio of dimension (…, time).

Returns

Tensor of audio of dimension (…, time).

Return type

Tensor

FrequencyMasking

class torchaudio.transforms.FrequencyMasking(freq_mask_param: int, iid_masks: bool = False)[source]

Apply masking to a spectrogram in the frequency domain.

Parameters
  • freq_mask_param (int) – maximum possible length of the mask. Indices uniformly sampled from [0, freq_mask_param).

  • iid_masks (bool, optional) – whether to apply different masks to each example/channel in the batch. (Default: False) This option is applicable only when the input tensor is 4D.

forward(specgram: torch.Tensor, mask_value: float = 0.0) → torch.Tensor
Parameters
  • specgram (Tensor) – Tensor of dimension (…, freq, time).

  • mask_value (float) – Value to assign to the masked columns.

Returns

Masked spectrogram of dimensions (…, freq, time).

Return type

Tensor

TimeMasking

class torchaudio.transforms.TimeMasking(time_mask_param: int, iid_masks: bool = False)[source]

Apply masking to a spectrogram in the time domain.

Parameters
  • time_mask_param (int) – maximum possible length of the mask. Indices uniformly sampled from [0, time_mask_param).

  • iid_masks (bool, optional) – whether to apply different masks to each example/channel in the batch. (Default: False) This option is applicable only when the input tensor is 4D.

forward(specgram: torch.Tensor, mask_value: float = 0.0) → torch.Tensor
Parameters
  • specgram (Tensor) – Tensor of dimension (…, freq, time).

  • mask_value (float) – Value to assign to the masked columns.

Returns

Masked spectrogram of dimensions (…, freq, time).

Return type

Tensor

Vol

class torchaudio.transforms.Vol(gain: float, gain_type: str = 'amplitude')[source]

Add a volume to an waveform.

Parameters
  • gain (float) – Interpreted according to the given gain_type: If gain_type = amplitude, gain is a positive amplitude ratio. If gain_type = power, gain is a power (voltage squared). If gain_type = db, gain is in decibels.

  • gain_type (str, optional) – Type of gain. One of: amplitude, power, db (Default: amplitude)

forward(waveform: torch.Tensor) → torch.Tensor[source]
Parameters

waveform (Tensor) – Tensor of audio of dimension (…, time).

Returns

Tensor of audio of dimension (…, time).

Return type

Tensor

SlidingWindowCmn

class torchaudio.transforms.SlidingWindowCmn(cmn_window: int = 600, min_cmn_window: int = 100, center: bool = False, norm_vars: bool = False)[source]

Apply sliding-window cepstral mean (and optionally variance) normalization per utterance.

Parameters
  • cmn_window (int, optional) – Window in frames for running average CMN computation (int, default = 600)

  • min_cmn_window (int, optional) – Minimum CMN window used at start of decoding (adds latency only at start). Only applicable if center == false, ignored if center==true (int, default = 100)

  • center (bool, optional) – If true, use a window centered on the current frame (to the extent possible, modulo end effects). If false, window is to the left. (bool, default = false)

  • norm_vars (bool, optional) – If true, normalize variance to one. (bool, default = false)

forward(waveform: torch.Tensor) → torch.Tensor[source]
Parameters

waveform (Tensor) – Tensor of audio of dimension (…, time).

Returns

Tensor of audio of dimension (…, time).

Return type

Tensor

Vad

class torchaudio.transforms.Vad(sample_rate: int, trigger_level: float = 7.0, trigger_time: float = 0.25, search_time: float = 1.0, allowed_gap: float = 0.25, pre_trigger_time: float = 0.0, boot_time: float = 0.35, noise_up_time: float = 0.1, noise_down_time: float = 0.01, noise_reduction_amount: float = 1.35, measure_freq: float = 20.0, measure_duration: Optional[float] = None, measure_smooth_time: float = 0.4, hp_filter_freq: float = 50.0, lp_filter_freq: float = 6000.0, hp_lifter_freq: float = 150.0, lp_lifter_freq: float = 2000.0)[source]

Voice Activity Detector. Similar to SoX implementation. Attempts to trim silence and quiet background sounds from the ends of recordings of speech. The algorithm currently uses a simple cepstral power measurement to detect voice, so may be fooled by other things, especially music.

The effect can trim only from the front of the audio, so in order to trim from the back, the reverse effect must also be used.

Parameters
  • sample_rate (int) – Sample rate of audio signal.

  • trigger_level (float, optional) – The measurement level used to trigger activity detection. This may need to be cahnged depending on the noise level, signal level, and other characteristics of the input audio. (Default: 7.0)

  • trigger_time (float, optional) – The time constant (in seconds) used to help ignore short bursts of sound. (Default: 0.25)

  • search_time (float, optional) – The amount of audio (in seconds) to search for quieter/shorter bursts of audio to include prior to the detected trigger point. (Default: 1.0)

  • allowed_gap (float, optional) – The allowed gap (in seconds) between quiteter/shorter bursts of audio to include prior to the detected trigger point. (Default: 0.25)

  • pre_trigger_time (float, optional) – The amount of audio (in seconds) to preserve before the trigger point and any found quieter/shorter bursts. (Default: 0.0)

  • boot_time (float, optional) The algorithm (internally) – estimation/reduction in order to detect the start of the wanted audio. This option sets the time for the initial noise estimate. (Default: 0.35)

  • noise_up_time (float, optional) – for when the noise level is increasing. (Default: 0.1)

  • noise_down_time (float, optional) – for when the noise level is decreasing. (Default: 0.01)

  • noise_reduction_amount (float, optional) – the detection algorithm (e.g. 0, 0.5, …). (Default: 1.35)

  • measure_freq (float, optional) – processing/measurements. (Default: 20.0)

  • measure_duration – (float, optional) Measurement duration. (Default: Twice the measurement period; i.e. with overlap.)

  • measure_smooth_time (float, optional) – spectral measurements. (Default: 0.4)

  • hp_filter_freq (float, optional) – at the input to the detector algorithm. (Default: 50.0)

  • lp_filter_freq (float, optional) – at the input to the detector algorithm. (Default: 6000.0)

  • hp_lifter_freq (float, optional) – in the detector algorithm. (Default: 150.0)

  • lp_lifter_freq (float, optional) – in the detector algorithm. (Default: 2000.0)

References

http://sox.sourceforge.net/sox.html

forward(waveform: torch.Tensor) → torch.Tensor[source]
Parameters

waveform (Tensor) – Tensor of audio of dimension (…, time)

Docs

Access comprehensive developer documentation for PyTorch

View Docs

Tutorials

Get in-depth tutorials for beginners and advanced developers

View Tutorials

Resources

Find development resources and get your questions answered

View Resources