torchaudio.transforms¶
Transforms are common audio transforms. They can be chained together using torch.nn.Sequential
Spectrogram¶
-
class
torchaudio.transforms.
Spectrogram
(n_fft: int = 400, win_length: Optional[int] = None, hop_length: Optional[int] = None, pad: int = 0, window_fn: Callable[[...], torch.Tensor] = <built-in method hann_window of type object>, power: Optional[float] = 2.0, normalized: bool = False, wkwargs: Optional[dict] = None, center: bool = True, pad_mode: str = 'reflect', onesided: bool = True, return_complex: bool = False)[source]¶ Create a spectrogram from a audio signal.
- Parameters
n_fft (int, optional) – Size of FFT, creates
n_fft // 2 + 1
bins. (Default:400
)win_length (int or None, optional) – Window size. (Default:
n_fft
)hop_length (int or None, optional) – Length of hop between STFT windows. (Default:
win_length // 2
)pad (int, optional) – Two sided padding of signal. (Default:
0
)window_fn (Callable[.., Tensor], optional) – A function to create a window tensor that is applied/multiplied to each frame/window. (Default:
torch.hann_window
)power (float or None, optional) – Exponent for the magnitude spectrogram, (must be > 0) e.g., 1 for energy, 2 for power, etc. If None, then the complex spectrum is returned instead. (Default:
2
)normalized (bool, optional) – Whether to normalize by magnitude after stft. (Default:
False
)wkwargs (dict or None, optional) – Arguments for window function. (Default:
None
)center (bool, optional) – whether to pad
waveform
on both sides so that the \(t\)-th frame is centered at time \(t \times \text{hop\_length}\). (Default:True
)pad_mode (string, optional) – controls the padding method used when
center
isTrue
. (Default:"reflect"
)onesided (bool, optional) – controls whether to return half of results to avoid redundancy (Default:
True
)return_complex (bool, optional) – Indicates whether the resulting complex-valued Tensor should be represented with native complex dtype, such as torch.cfloat and torch.cdouble, or real dtype mimicking complex value with an extra dimension for real and imaginary parts. This argument is only effective when
power=None
. See alsotorch.view_as_real
.
-
forward
(waveform: torch.Tensor) → torch.Tensor[source]¶ - Parameters
waveform (Tensor) – Tensor of audio of dimension (…, time).
- Returns
Dimension (…, freq, time), where freq is
n_fft // 2 + 1
wheren_fft
is the number of Fourier bins, and time is the number of window hops (n_frame).- Return type
Tensor
GriffinLim¶
-
class
torchaudio.transforms.
GriffinLim
(n_fft: int = 400, n_iter: int = 32, win_length: Optional[int] = None, hop_length: Optional[int] = None, window_fn: Callable[[...], torch.Tensor] = <built-in method hann_window of type object>, power: float = 2.0, wkwargs: Optional[dict] = None, momentum: float = 0.99, length: Optional[int] = None, rand_init: bool = True)[source]¶ Compute waveform from a linear scale magnitude spectrogram using the Griffin-Lim transformation.
Implementation ported from 1, 2 and 3.
- Parameters
n_fft (int, optional) – Size of FFT, creates
n_fft // 2 + 1
bins. (Default:400
)n_iter (int, optional) – Number of iteration for phase recovery process. (Default:
32
)win_length (int or None, optional) – Window size. (Default:
n_fft
)hop_length (int or None, optional) – Length of hop between STFT windows. (Default:
win_length // 2
)window_fn (Callable[.., Tensor], optional) – A function to create a window tensor that is applied/multiplied to each frame/window. (Default:
torch.hann_window
)power (float, optional) – Exponent for the magnitude spectrogram, (must be > 0) e.g., 1 for energy, 2 for power, etc. (Default:
2
)wkwargs (dict or None, optional) – Arguments for window function. (Default:
None
)momentum (float, optional) – The momentum parameter for fast Griffin-Lim. Setting this to 0 recovers the original Griffin-Lim method. Values near 1 can lead to faster convergence, but above 1 may not converge. (Default:
0.99
)length (int, optional) – Array length of the expected output. (Default:
None
)rand_init (bool, optional) – Initializes phase randomly if True and to zero otherwise. (Default:
True
)
AmplitudeToDB¶
-
class
torchaudio.transforms.
AmplitudeToDB
(stype: str = 'power', top_db: Optional[float] = None)[source]¶ Turn a tensor from the power/amplitude scale to the decibel scale.
This output depends on the maximum value in the input tensor, and so may return different values for an audio clip split into snippets vs. a a full clip.
- Parameters
-
forward
(x: torch.Tensor) → torch.Tensor[source]¶ Numerically stable implementation from Librosa.
https://librosa.org/doc/latest/generated/librosa.amplitude_to_db.html
- Parameters
x (Tensor) – Input tensor before being converted to decibel scale.
- Returns
Output tensor in decibel scale.
- Return type
Tensor
MelScale¶
-
class
torchaudio.transforms.
MelScale
(n_mels: int = 128, sample_rate: int = 16000, f_min: float = 0.0, f_max: Optional[float] = None, n_stft: Optional[int] = None, norm: Optional[str] = None, mel_scale: str = 'htk')[source]¶ Turn a normal STFT into a mel frequency STFT, using a conversion matrix. This uses triangular filter banks.
User can control which device the filter bank (fb) is (e.g. fb.to(spec_f.device)).
- Parameters
n_mels (int, optional) – Number of mel filterbanks. (Default:
128
)sample_rate (int, optional) – Sample rate of audio signal. (Default:
16000
)f_min (float, optional) – Minimum frequency. (Default:
0.
)f_max (float or None, optional) – Maximum frequency. (Default:
sample_rate // 2
)n_stft (int, optional) – Number of bins in STFT. Calculated from first input if None is given. See
n_fft
inSpectrogram
. (Default:None
)norm (Optional[str]) – If ‘slaney’, divide the triangular mel weights by the width of the mel band
normalization). (Default ((area) –
None
)mel_scale (str, optional) – Scale to use:
htk
orslaney
. (Default:htk
)
InverseMelScale¶
-
class
torchaudio.transforms.
InverseMelScale
(n_stft: int, n_mels: int = 128, sample_rate: int = 16000, f_min: float = 0.0, f_max: Optional[float] = None, max_iter: int = 100000, tolerance_loss: float = 1e-05, tolerance_change: float = 1e-08, sgdargs: Optional[dict] = None, norm: Optional[str] = None, mel_scale: str = 'htk')[source]¶ Solve for a normal STFT from a mel frequency STFT, using a conversion matrix. This uses triangular filter banks.
It minimizes the euclidian norm between the input mel-spectrogram and the product between the estimated spectrogram and the filter banks using SGD.
- Parameters
n_stft (int) – Number of bins in STFT. See
n_fft
inSpectrogram
.n_mels (int, optional) – Number of mel filterbanks. (Default:
128
)sample_rate (int, optional) – Sample rate of audio signal. (Default:
16000
)f_min (float, optional) – Minimum frequency. (Default:
0.
)f_max (float or None, optional) – Maximum frequency. (Default:
sample_rate // 2
)max_iter (int, optional) – Maximum number of optimization iterations. (Default:
100000
)tolerance_loss (float, optional) – Value of loss to stop optimization at. (Default:
1e-5
)tolerance_change (float, optional) – Difference in losses to stop optimization at. (Default:
1e-8
)sgdargs (dict or None, optional) – Arguments for the SGD optimizer. (Default:
None
)norm (Optional[str]) – If ‘slaney’, divide the triangular mel weights by the width of the mel band (area normalization). (Default:
None
)mel_scale (str, optional) – Scale to use:
htk
orslaney
. (Default:htk
)
MelSpectrogram¶
-
class
torchaudio.transforms.
MelSpectrogram
(sample_rate: int = 16000, n_fft: int = 400, win_length: Optional[int] = None, hop_length: Optional[int] = None, f_min: float = 0.0, f_max: Optional[float] = None, pad: int = 0, n_mels: int = 128, window_fn: Callable[[...], torch.Tensor] = <built-in method hann_window of type object>, power: Optional[float] = 2.0, normalized: bool = False, wkwargs: Optional[dict] = None, center: bool = True, pad_mode: str = 'reflect', onesided: bool = True, norm: Optional[str] = None, mel_scale: str = 'htk')[source]¶ Create MelSpectrogram for a raw audio signal. This is a composition of Spectrogram and MelScale.
- Sources
- Parameters
sample_rate (int, optional) – Sample rate of audio signal. (Default:
16000
)n_fft (int, optional) – Size of FFT, creates
n_fft // 2 + 1
bins. (Default:400
)win_length (int or None, optional) – Window size. (Default:
n_fft
)hop_length (int or None, optional) – Length of hop between STFT windows. (Default:
win_length // 2
)f_min (float, optional) – Minimum frequency. (Default:
0.
)f_max (float or None, optional) – Maximum frequency. (Default:
None
)pad (int, optional) – Two sided padding of signal. (Default:
0
)n_mels (int, optional) – Number of mel filterbanks. (Default:
128
)window_fn (Callable[.., Tensor], optional) – A function to create a window tensor that is applied/multiplied to each frame/window. (Default:
torch.hann_window
)power (float, optional) – Exponent for the magnitude spectrogram, (must be > 0) e.g., 1 for energy, 2 for power, etc. (Default:
2
)normalized (bool, optional) – Whether to normalize by magnitude after stft. (Default:
False
)wkwargs (Dict[.., ..] or None, optional) – Arguments for window function. (Default:
None
)center (bool, optional) – whether to pad
waveform
on both sides so that the \(t\)-th frame is centered at time \(t \times \text{hop\_length}\). (Default:True
)pad_mode (string, optional) – controls the padding method used when
center
isTrue
. (Default:"reflect"
)onesided (bool, optional) – controls whether to return half of results to avoid redundancy. (Default:
True
)norm (Optional[str]) – If ‘slaney’, divide the triangular mel weights by the width of the mel band (area normalization). (Default:
None
)mel_scale (str, optional) – Scale to use:
htk
orslaney
. (Default:htk
)
- Example
>>> waveform, sample_rate = torchaudio.load('test.wav', normalization=True) >>> mel_specgram = transforms.MelSpectrogram(sample_rate)(waveform) # (channel, n_mels, time)
MFCC¶
-
class
torchaudio.transforms.
MFCC
(sample_rate: int = 16000, n_mfcc: int = 40, dct_type: int = 2, norm: str = 'ortho', log_mels: bool = False, melkwargs: Optional[dict] = None)[source]¶ Create the Mel-frequency cepstrum coefficients from an audio signal.
By default, this calculates the MFCC on the DB-scaled Mel spectrogram. This is not the textbook implementation, but is implemented here to give consistency with librosa.
This output depends on the maximum value in the input spectrogram, and so may return different values for an audio clip split into snippets vs. a a full clip.
- Parameters
sample_rate (int, optional) – Sample rate of audio signal. (Default:
16000
)n_mfcc (int, optional) – Number of mfc coefficients to retain. (Default:
40
)dct_type (int, optional) – type of DCT (discrete cosine transform) to use. (Default:
2
)norm (str, optional) – norm to use. (Default:
'ortho'
)log_mels (bool, optional) – whether to use log-mel spectrograms instead of db-scaled. (Default:
False
)melkwargs (dict or None, optional) – arguments for MelSpectrogram. (Default:
None
)
MuLawEncoding¶
-
class
torchaudio.transforms.
MuLawEncoding
(quantization_channels: int = 256)[source]¶ Encode signal based on mu-law companding. For more info see the Wikipedia Entry
This algorithm assumes the signal has been scaled to between -1 and 1 and returns a signal encoded with values from 0 to quantization_channels - 1
- Parameters
quantization_channels (int, optional) – Number of channels. (Default:
256
)
MuLawDecoding¶
-
class
torchaudio.transforms.
MuLawDecoding
(quantization_channels: int = 256)[source]¶ Decode mu-law encoded signal. For more info see the Wikipedia Entry
This expects an input with values between 0 and quantization_channels - 1 and returns a signal scaled between -1 and 1.
- Parameters
quantization_channels (int, optional) – Number of channels. (Default:
256
)
Resample¶
-
class
torchaudio.transforms.
Resample
(orig_freq: float = 16000, new_freq: float = 16000, resampling_method: str = 'sinc_interpolation', lowpass_filter_width: int = 6, rolloff: float = 0.99, beta: Optional[float] = None, *, dtype: Optional[torch.dtype] = None)[source]¶ Resample a signal from one frequency to another. A resampling method can be given.
Note
If resampling on waveforms of higher precision than float32, there may be a small loss of precision because the kernel is cached once as float32. If high precision resampling is important for your application, the functional form will retain higher precision, but run slower because it does not cache the kernel. Alternatively, you could rewrite a transform that caches a higher precision kernel.
- Parameters
orig_freq (float, optional) – The original frequency of the signal. (Default:
16000
)new_freq (float, optional) – The desired frequency. (Default:
16000
)resampling_method (str, optional) – The resampling method to use. Options: [
sinc_interpolation
,kaiser_window
] (Default:'sinc_interpolation'
)lowpass_filter_width (int, optional) – Controls the sharpness of the filter, more == sharper but less efficient. (Default:
6
)rolloff (float, optional) – The roll-off frequency of the filter, as a fraction of the Nyquist. Lower values reduce anti-aliasing, but also reduce some of the highest frequencies. (Default:
0.99
)beta (float or None) – The shape parameter used for kaiser window.
dtype (torch.device, optional) – Determnines the precision that resampling kernel is pre-computed and cached. If not provided, kernel is computed with
torch.float64
then cached astorch.float32
. If you need higher precision, providetorch.float64
, and the pre-computed kernel is computed and cached astorch.float64
. If you use resample with lower precision, then instead of providing this providing this argument, please useResample.to(dtype)
, so that the kernel generation is still carried out ontorch.float64
.
ComplexNorm¶
ComputeDeltas¶
TimeStretch¶
-
class
torchaudio.transforms.
TimeStretch
(hop_length: Optional[int] = None, n_freq: int = 201, fixed_rate: Optional[float] = None)[source]¶ Stretch stft in time without modifying pitch for a given rate.
- Parameters
hop_length (int or None, optional) – Length of hop between STFT windows. (Default:
win_length // 2
)n_freq (int, optional) – number of filter banks from stft. (Default:
201
)fixed_rate (float or None, optional) – rate to speed up or slow down by. If None is provided, rate must be passed to the forward method. (Default:
None
)
-
forward
(complex_specgrams: torch.Tensor, overriding_rate: Optional[float] = None) → torch.Tensor[source]¶ - Parameters
complex_specgrams (Tensor) – Either a real tensor of dimension of
(..., freq, num_frame, complex=2)
or a tensor of dimension(..., freq, num_frame)
with complex dtype.overriding_rate (float or None, optional) – speed up to apply to this batch. If no rate is passed, use
self.fixed_rate
. (Default:None
)
- Returns
Stretched spectrogram. The resulting tensor is of the same dtype as the input spectrogram, but the number of frames is changed to
ceil(num_frame / rate)
.- Return type
Tensor
Fade¶
-
class
torchaudio.transforms.
Fade
(fade_in_len: int = 0, fade_out_len: int = 0, fade_shape: str = 'linear')[source]¶ Add a fade in and/or fade out to an waveform.
- Parameters
fade_in_len (int, optional) – Length of fade-in (time frames). (Default:
0
)fade_out_len (int, optional) – Length of fade-out (time frames). (Default:
0
)fade_shape (str, optional) – Shape of fade. Must be one of: “quarter_sine”, “half_sine”, “linear”, “logarithmic”, “exponential”. (Default:
"linear"
)
FrequencyMasking¶
TimeMasking¶
Vol¶
-
class
torchaudio.transforms.
Vol
(gain: float, gain_type: str = 'amplitude')[source]¶ Add a volume to an waveform.
- Parameters
gain (float) – Interpreted according to the given gain_type: If
gain_type
=amplitude
,gain
is a positive amplitude ratio. Ifgain_type
=power
,gain
is a power (voltage squared). Ifgain_type
=db
,gain
is in decibels.gain_type (str, optional) – Type of gain. One of:
amplitude
,power
,db
(Default:amplitude
)
SlidingWindowCmn¶
-
class
torchaudio.transforms.
SlidingWindowCmn
(cmn_window: int = 600, min_cmn_window: int = 100, center: bool = False, norm_vars: bool = False)[source]¶ Apply sliding-window cepstral mean (and optionally variance) normalization per utterance.
- Parameters
cmn_window (int, optional) – Window in frames for running average CMN computation (int, default = 600)
min_cmn_window (int, optional) – Minimum CMN window used at start of decoding (adds latency only at start). Only applicable if center == false, ignored if center==true (int, default = 100)
center (bool, optional) – If true, use a window centered on the current frame (to the extent possible, modulo end effects). If false, window is to the left. (bool, default = false)
norm_vars (bool, optional) – If true, normalize variance to one. (bool, default = false)
SpectralCentroid¶
-
class
torchaudio.transforms.
SpectralCentroid
(sample_rate: int, n_fft: int = 400, win_length: Optional[int] = None, hop_length: Optional[int] = None, pad: int = 0, window_fn: Callable[[...], torch.Tensor] = <built-in method hann_window of type object>, wkwargs: Optional[dict] = None)[source]¶ Compute the spectral centroid for each channel along the time axis.
The spectral centroid is defined as the weighted average of the frequency values, weighted by their magnitude.
- Parameters
sample_rate (int) – Sample rate of audio signal.
n_fft (int, optional) – Size of FFT, creates
n_fft // 2 + 1
bins. (Default:400
)win_length (int or None, optional) – Window size. (Default:
n_fft
)hop_length (int or None, optional) – Length of hop between STFT windows. (Default:
win_length // 2
)pad (int, optional) – Two sided padding of signal. (Default:
0
)window_fn (Callable[.., Tensor], optional) – A function to create a window tensor that is applied/multiplied to each frame/window. (Default:
torch.hann_window
)wkwargs (dict or None, optional) – Arguments for window function. (Default:
None
)
- Example
>>> waveform, sample_rate = torchaudio.load('test.wav', normalization=True) >>> spectral_centroid = transforms.SpectralCentroid(sample_rate)(waveform) # (channel, time)
Vad¶
-
class
torchaudio.transforms.
Vad
(sample_rate: int, trigger_level: float = 7.0, trigger_time: float = 0.25, search_time: float = 1.0, allowed_gap: float = 0.25, pre_trigger_time: float = 0.0, boot_time: float = 0.35, noise_up_time: float = 0.1, noise_down_time: float = 0.01, noise_reduction_amount: float = 1.35, measure_freq: float = 20.0, measure_duration: Optional[float] = None, measure_smooth_time: float = 0.4, hp_filter_freq: float = 50.0, lp_filter_freq: float = 6000.0, hp_lifter_freq: float = 150.0, lp_lifter_freq: float = 2000.0)[source]¶ Voice Activity Detector. Similar to SoX implementation. Attempts to trim silence and quiet background sounds from the ends of recordings of speech. The algorithm currently uses a simple cepstral power measurement to detect voice, so may be fooled by other things, especially music.
The effect can trim only from the front of the audio, so in order to trim from the back, the reverse effect must also be used.
- Parameters
sample_rate (int) – Sample rate of audio signal.
trigger_level (float, optional) – The measurement level used to trigger activity detection. This may need to be cahnged depending on the noise level, signal level, and other characteristics of the input audio. (Default: 7.0)
trigger_time (float, optional) – The time constant (in seconds) used to help ignore short bursts of sound. (Default: 0.25)
search_time (float, optional) – The amount of audio (in seconds) to search for quieter/shorter bursts of audio to include prior to the detected trigger point. (Default: 1.0)
allowed_gap (float, optional) – The allowed gap (in seconds) between quiteter/shorter bursts of audio to include prior to the detected trigger point. (Default: 0.25)
pre_trigger_time (float, optional) – The amount of audio (in seconds) to preserve before the trigger point and any found quieter/shorter bursts. (Default: 0.0)
boot_time (float, optional) The algorithm (internally) – estimation/reduction in order to detect the start of the wanted audio. This option sets the time for the initial noise estimate. (Default: 0.35)
noise_up_time (float, optional) – for when the noise level is increasing. (Default: 0.1)
noise_down_time (float, optional) – for when the noise level is decreasing. (Default: 0.01)
noise_reduction_amount (float, optional) – the detection algorithm (e.g. 0, 0.5, …). (Default: 1.35)
measure_freq (float, optional) – processing/measurements. (Default: 20.0)
measure_duration – (float, optional) Measurement duration. (Default: Twice the measurement period; i.e. with overlap.)
measure_smooth_time (float, optional) – spectral measurements. (Default: 0.4)
hp_filter_freq (float, optional) – at the input to the detector algorithm. (Default: 50.0)
lp_filter_freq (float, optional) – at the input to the detector algorithm. (Default: 6000.0)
hp_lifter_freq (float, optional) – in the detector algorithm. (Default: 150.0)
lp_lifter_freq (float, optional) – in the detector algorithm. (Default: 2000.0)
- Reference:
-
forward
(waveform: torch.Tensor) → torch.Tensor[source]¶ - Parameters
waveform (Tensor) – Tensor of audio of dimension (channels, time) or (time) Tensor of shape (channels, time) is treated as a multi-channel recording of the same event and the resulting output will be trimmed to the earliest voice activity in any channel.
References¶
- 1
Brian McFee, Colin Raffel, Dawen Liang, Daniel P.W. Ellis, Matt McVicar, Eric Battenberg, and Oriol Nieto. Librosa: Audio and Music Signal Analysis in Python. In Kathryn Huff and James Bergstra, editors, Proceedings of the 14th Python in Science Conference, 18 – 24. 2015. doi:10.25080/Majora-7b98e3ed-003.
- 2
Nathanaël Perraudin, Peter Balazs, and Peter L. Søndergaard. A fast griffin-lim algorithm. In 2013 IEEE Workshop on Applications of Signal Processing to Audio and Acoustics, volume, 1–4. 2013. doi:10.1109/WASPAA.2013.6701851.
- 3
D. Griffin and Jae Lim. Signal estimation from modified short-time fourier transform. In ICASSP ‘83. IEEE International Conference on Acoustics, Speech, and Signal Processing, volume 8, 804–807. 1983. doi:10.1109/ICASSP.1983.1172092.