Note
Click here to download the full example code
MVDR with torchaudio
Author Zhaoheng Ni
Overview
This is a tutorial on how to apply MVDR beamforming by using torchaudio.
Steps
Ideal Ratio Mask (IRM) is generated by dividing the clean/noise magnitude by the mixture magnitude.
We test all three solutions (
ref_channel
,stv_evd
,stv_power
) of torchaudio’s MVDR module.We test the single-channel and multi-channel masks for MVDR beamforming. The multi-channel mask is averaged along channel dimension when computing the covariance matrices of speech and noise, respectively.
Preparation
First, we import the necessary packages and retrieve the data.
The multi-channel audio example is selected from ConferencingSpeech dataset.
The original filename is
SSB07200001\#noise-sound-bible-0038\#7.86_6.16_3.00_3.14_4.84_134.5285_191.7899_0.4735\#15217\#25.16333303751458\#0.2101221178590021.wav
which was generated with;
SSB07200001.wav
from AISHELL-3 (Apache License v.2.0)noise-sound-bible-0038.wav
from MUSAN (Attribution 4.0 International — CC BY 4.0)
import os
import requests
import torch
import torchaudio
import IPython.display as ipd
torch.random.manual_seed(0)
device = torch.device('cuda' if torch.cuda.is_available() else 'cpu')
print(torch.__version__)
print(torchaudio.__version__)
print(device)
filenames = [
'mix.wav',
'reverb_clean.wav',
'clean.wav',
]
base_url = 'https://download.pytorch.org/torchaudio/tutorial-assets/mvdr'
for filename in filenames:
os.makedirs('_assets', exist_ok=True)
if not os.path.exists(filename):
with open(f'_assets/{filename}', 'wb') as file:
file.write(requests.get(f'{base_url}/{filename}').content)
Out:
1.10.0+cpu
0.10.0+cpu
cpu
Generate the Ideal Ratio Mask (IRM)
Loading audio data
Note
The MVDR Module requires torch.cdouble
dtype for noisy STFT.
We need to convert the dtype of the waveforms to torch.double
mix = mix.to(torch.double)
noise = noise.to(torch.double)
clean = clean.to(torch.double)
reverb_clean = reverb_clean.to(torch.double)
Compute STFT
stft = torchaudio.transforms.Spectrogram(
n_fft=1024,
hop_length=256,
power=None,
)
istft = torchaudio.transforms.InverseSpectrogram(n_fft=1024, hop_length=256)
spec_mix = stft(mix)
spec_clean = stft(clean)
spec_reverb_clean = stft(reverb_clean)
spec_noise = stft(noise)
Generate the Ideal Ratio Mask (IRM)
Note
We found using the mask directly peforms better than using the square root of it. This is slightly different from the definition of IRM.
def get_irms(spec_clean, spec_noise, spec_mix):
mag_mix = spec_mix.abs() ** 2
mag_clean = spec_clean.abs() ** 2
mag_noise = spec_noise.abs() ** 2
irm_speech = mag_clean / (mag_clean + mag_noise)
irm_noise = mag_noise / (mag_clean + mag_noise)
return irm_speech, irm_noise
Note
We use reverberant clean speech as the target here, you can also set it to dry clean speech.
irm_speech, irm_noise = get_irms(spec_reverb_clean, spec_noise, spec_mix)
Apply MVDR
Apply MVDR beamforming by using multi-channel masks
results_multi = {}
for solution in ['ref_channel', 'stv_evd', 'stv_power']:
mvdr = torchaudio.transforms.MVDR(ref_channel=0, solution=solution, multi_mask=True)
stft_est = mvdr(spec_mix, irm_speech, irm_noise)
est = istft(stft_est, length=mix.shape[-1])
results_multi[solution] = est
Apply MVDR beamforming by using single-channel masks
We use the 1st channel as an example. The channel selection may depend on the design of the microphone array
results_single = {}
for solution in ['ref_channel', 'stv_evd', 'stv_power']:
mvdr = torchaudio.transforms.MVDR(ref_channel=0, solution=solution, multi_mask=False)
stft_est = mvdr(spec_mix, irm_speech[0], irm_noise[0])
est = istft(stft_est, length=mix.shape[-1])
results_single[solution] = est
Compute Si-SDR scores
def si_sdr(estimate, reference, epsilon=1e-8):
estimate = estimate - estimate.mean()
reference = reference - reference.mean()
reference_pow = reference.pow(2).mean(axis=1, keepdim=True)
mix_pow = (estimate * reference).mean(axis=1, keepdim=True)
scale = mix_pow / (reference_pow + epsilon)
reference = scale * reference
error = estimate - reference
reference_pow = reference.pow(2)
error_pow = error.pow(2)
reference_pow = reference_pow.mean(axis=1)
error_pow = error_pow.mean(axis=1)
sisdr = 10 * torch.log10(reference_pow) - 10 * torch.log10(error_pow)
return sisdr.item()
Results
Single-channel mask results
for solution in results_single:
print(solution+": ", si_sdr(results_single[solution][None,...], reverb_clean[0:1]))
Out:
ref_channel: 15.035907456985868
stv_evd: 16.563734673832553
stv_power: 17.820481909929907
Multi-channel mask results
for solution in results_multi:
print(solution+": ", si_sdr(results_multi[solution][None,...], reverb_clean[0:1]))
Out:
ref_channel: 13.177373866143256
stv_evd: 12.433610809532858
stv_power: 12.897505397104673
Enhanced audio
Single-channel mask, stv_power solution
ipd.Audio(results_single['stv_power'], rate=16000)
Total running time of the script: ( 0 minutes 0.841 seconds)