MVDR¶
- class torchaudio.transforms.MVDR(ref_channel: int = 0, solution: str = 'ref_channel', multi_mask: bool = False, diag_loading: bool = True, diag_eps: float = 1e-07, online: bool = False)[source]¶
Minimum Variance Distortionless Response (MVDR) module that performs MVDR beamforming with Time-Frequency masks.
Based on https://github.com/espnet/espnet/blob/master/espnet2/enh/layers/beamformer.py
We provide three solutions of MVDR beamforming. One is based on reference channel selection [Souden et al., 2009] (
solution=ref_channel
).\[\textbf{w}_{\text{MVDR}}(f) = \frac{{{\bf{\Phi}_{\textbf{NN}}^{-1}}(f){\bf{\Phi}_{\textbf{SS}}}}(f)} {\text{Trace}({{{\bf{\Phi}_{\textbf{NN}}^{-1}}(f) \bf{\Phi}_{\textbf{SS}}}(f))}}\bm{u} \]where \(\bf{\Phi}_{\textbf{SS}}\) and \(\bf{\Phi}_{\textbf{NN}}\) are the covariance matrices of speech and noise, respectively. \(\bf{u}\) is an one-hot vector to determine the reference channel.
The other two solutions are based on the steering vector (
solution=stv_evd
orsolution=stv_power
).\[\textbf{w}_{\text{MVDR}}(f) = \frac{{{\bf{\Phi}_{\textbf{NN}}^{-1}}(f){\bm{v}}(f)}} {{\bm{v}^{\mathsf{H}}}(f){\bf{\Phi}_{\textbf{NN}}^{-1}}(f){\bm{v}}(f)} \]where \(\bm{v}\) is the acoustic transfer function or the steering vector. \(.^{\mathsf{H}}\) denotes the Hermitian Conjugate operation.
We apply either eigenvalue decomposition [Higuchi et al., 2016] or the power method [Mises and Pollaczek-Geiringer, 1929] to get the steering vector from the PSD matrix of speech.
After estimating the beamforming weight, the enhanced Short-time Fourier Transform (STFT) is obtained by
\[\hat{\bf{S}} = {\bf{w}^\mathsf{H}}{\bf{Y}}, {\bf{w}} \in \mathbb{C}^{M \times F} \]where \(\bf{Y}\) and \(\hat{\bf{S}}\) are the STFT of the multi-channel noisy speech and the single-channel enhanced speech, respectively.
For online streaming audio, we provide a recursive method [Higuchi et al., 2017] to update the PSD matrices of speech and noise, respectively.
- Parameters:
ref_channel (int, optional) – Reference channel for beamforming. (Default:
0
)solution (str, optional) – Solution to compute the MVDR beamforming weights. Options: [
ref_channel
,stv_evd
,stv_power
]. (Default:ref_channel
)multi_mask (bool, optional) – If
True
, only accepts multi-channel Time-Frequency masks. (Default:False
)diagonal_loading (bool, optional) – If
True
, enables applying diagonal loading to the covariance matrix of the noise. (Default:True
)diag_eps (float, optional) – The coefficient multiplied to the identity matrix for diagonal loading. It is only effective when
diagonal_loading
is set toTrue
. (Default:1e-7
)online (bool, optional) – If
True
, updates the MVDR beamforming weights based on the previous covarience matrices. (Default:False
)
Note
To improve the numerical stability, the input spectrogram will be converted to double precision (
torch.complex128
ortorch.cdouble
) dtype for internal computation. The output spectrogram is converted to the dtype of the input spectrogram to be compatible with other modules.Note
If you use
stv_evd
solution, the gradient of the same input may not be identical if the eigenvalues of the PSD matrix are not distinct (i.e. some eigenvalues are close or identical).- forward(specgram: Tensor, mask_s: Tensor, mask_n: Optional[Tensor] = None) Tensor [source]¶
Perform MVDR beamforming.
- Parameters:
specgram (torch.Tensor) – Multi-channel complex-valued spectrum. Tensor with dimensions (…, channel, freq, time)
mask_s (torch.Tensor) – Time-Frequency mask of target speech. Tensor with dimensions (…, freq, time) if multi_mask is
False
or with dimensions (…, channel, freq, time) if multi_mask isTrue
.mask_n (torch.Tensor or None, optional) – Time-Frequency mask of noise. Tensor with dimensions (…, freq, time) if multi_mask is
False
or with dimensions (…, channel, freq, time) if multi_mask isTrue
. (Default: None)
- Returns:
Single-channel complex-valued enhanced spectrum with dimensions (…, freq, time).
- Return type: