MVDR

class torchaudio.transforms.MVDR(ref_channel: int = 0, solution: str = 'ref_channel', multi_mask: bool = False, diag_loading: bool = True, diag_eps: float = 1e-07, online: bool = False)[source]

Minimum Variance Distortionless Response (MVDR) module that performs MVDR beamforming with Time-Frequency masks.

Based on https://github.com/espnet/espnet/blob/master/espnet2/enh/layers/beamformer.py

We provide three solutions of MVDR beamforming. One is based on reference channel selection [Souden et al., 2009] (solution=ref_channel).

\textbf{w}_{\text{MVDR}}(f) = \frac{{{\bf{\Phi}_{\textbf{NN}}^{-1}}(f){\bf{\Phi}_{\textbf{SS}}}}(f)} {\text{Trace}({{{\bf{\Phi}_{\textbf{NN}}^{-1}}(f) \bf{\Phi}_{\textbf{SS}}}(f))}}\bm{u}

where $\bf{\Phi}_{\textbf{SS}}$ and $\bf{\Phi}_{\textbf{NN}}$ are the covariance matrices of speech and noise, respectively. $\bf{u}$ is an one-hot vector to determine the reference channel.

The other two solutions are based on the steering vector (solution=stv_evd or solution=stv_power).

\textbf{w}_{\text{MVDR}}(f) = \frac{{{\bf{\Phi}_{\textbf{NN}}^{-1}}(f){\bm{v}}(f)}} {{\bm{v}^{\mathsf{H}}}(f){\bf{\Phi}_{\textbf{NN}}^{-1}}(f){\bm{v}}(f)}

where $\bm{v}$ is the acoustic transfer function or the steering vector. $.^{\mathsf{H}}$ denotes the Hermitian Conjugate operation.

We apply either eigenvalue decomposition [Higuchi et al., 2016] or the power method [Mises and Pollaczek-Geiringer, 1929] to get the steering vector from the PSD matrix of speech.

After estimating the beamforming weight, the enhanced Short-time Fourier Transform (STFT) is obtained by

\hat{\bf{S}} = {\bf{w}^\mathsf{H}}{\bf{Y}}, {\bf{w}} \in \mathbb{C}^{M \times F}

where $\bf{Y}$ and $\hat{\bf{S}}$ are the STFT of the multi-channel noisy speech and the single-channel enhanced speech, respectively.

For online streaming audio, we provide a recursive method [Higuchi et al., 2017] to update the PSD matrices of speech and noise, respectively.

Parameters:

ref_channel (int, optional) – Reference channel for beamforming. (Default: 0)
solution (str, optional) – Solution to compute the MVDR beamforming weights. Options: [ref_channel, stv_evd, stv_power]. (Default: ref_channel)
multi_mask (bool, optional) – If True, only accepts multi-channel Time-Frequency masks. (Default: False)
diagonal_loading (bool, optional) – If True, enables applying diagonal loading to the covariance matrix of the noise. (Default: True)
diag_eps (float, optional) – The coefficient multiplied to the identity matrix for diagonal loading. It is only effective when diagonal_loading is set to True. (Default: 1e-7)
online (bool, optional) – If True, updates the MVDR beamforming weights based on the previous covarience matrices. (Default: False)

Note

To improve the numerical stability, the input spectrogram will be converted to double precision (torch.complex128 or torch.cdouble) dtype for internal computation. The output spectrogram is converted to the dtype of the input spectrogram to be compatible with other modules.

Note

If you use stv_evd solution, the gradient of the same input may not be identical if the eigenvalues of the PSD matrix are not distinct (i.e. some eigenvalues are close or identical).

forward(specgram: Tensor, mask_s: Tensor, mask_n: Optional[Tensor] = None) → Tensor[source]

Perform MVDR beamforming.

Parameters:

specgram (torch.Tensor) – Multi-channel complex-valued spectrum. Tensor with dimensions (…, channel, freq, time)
mask_s (torch.Tensor) – Time-Frequency mask of target speech. Tensor with dimensions (…, freq, time) if multi_mask is False or with dimensions (…, channel, freq, time) if multi_mask is True.
mask_n (torch.Tensor or None, optional) – Time-Frequency mask of noise. Tensor with dimensions (…, freq, time) if multi_mask is False or with dimensions (…, channel, freq, time) if multi_mask is True. (Default: None)

Returns:

Single-channel complex-valued enhanced spectrum with dimensions (…, freq, time).

Return type:

torch.Tensor

MVDR

Docs

Tutorials

Resources