.. DO NOT EDIT. .. THIS FILE WAS AUTOMATICALLY GENERATED BY SPHINX-GALLERY. .. TO MAKE CHANGES, EDIT THE SOURCE PYTHON FILE: .. "tutorials/audio_feature_extractions_tutorial.py" .. LINE NUMBERS ARE GIVEN BELOW. .. only:: html .. note:: :class: sphx-glr-download-link-note Click :ref:`here ` to download the full example code .. rst-class:: sphx-glr-example-title .. _sphx_glr_tutorials_audio_feature_extractions_tutorial.py: Audio Feature Extractions ========================= **Author**: `Moto Hira `__ ``torchaudio`` implements feature extractions commonly used in the audio domain. They are available in ``torchaudio.functional`` and ``torchaudio.transforms``. ``functional`` implements features as standalone functions. They are stateless. ``transforms`` implements features as objects, using implementations from ``functional`` and ``torch.nn.Module``. They can be serialized using TorchScript. .. GENERATED FROM PYTHON SOURCE LINES 19-28 .. code-block:: default import torch import torchaudio import torchaudio.functional as F import torchaudio.transforms as T print(torch.__version__) print(torchaudio.__version__) .. rst-class:: sphx-glr-script-out .. code-block:: none 1.13.0 0.13.0 .. GENERATED FROM PYTHON SOURCE LINES 29-40 Preparation ----------- .. note:: When running this tutorial in Google Colab, install the required packages .. code:: !pip install librosa .. GENERATED FROM PYTHON SOURCE LINES 40-82 .. code-block:: default from IPython.display import Audio import librosa import matplotlib.pyplot as plt from torchaudio.utils import download_asset torch.random.manual_seed(0) SAMPLE_SPEECH = download_asset("tutorial-assets/Lab41-SRI-VOiCES-src-sp0307-ch127535-sg0042.wav") def plot_waveform(waveform, sr, title="Waveform"): waveform = waveform.numpy() num_channels, num_frames = waveform.shape time_axis = torch.arange(0, num_frames) / sr figure, axes = plt.subplots(num_channels, 1) axes.plot(time_axis, waveform[0], linewidth=1) axes.grid(True) figure.suptitle(title) plt.show(block=False) def plot_spectrogram(specgram, title=None, ylabel="freq_bin"): fig, axs = plt.subplots(1, 1) axs.set_title(title or "Spectrogram (db)") axs.set_ylabel(ylabel) axs.set_xlabel("frame") im = axs.imshow(librosa.power_to_db(specgram), origin="lower", aspect="auto") fig.colorbar(im, ax=axs) plt.show(block=False) def plot_fbank(fbank, title=None): fig, axs = plt.subplots(1, 1) axs.set_title(title or "Filter bank") axs.imshow(fbank, aspect="auto") axs.set_ylabel("frequency bin") axs.set_xlabel("mel bin") plt.show(block=False) .. GENERATED FROM PYTHON SOURCE LINES 83-94 Overview of audio features -------------------------- The following diagram shows the relationship between common audio features and torchaudio APIs to generate them. .. image:: https://download.pytorch.org/torchaudio/tutorial-assets/torchaudio_feature_extractions.png For the complete list of available features, please refer to the documentation. .. GENERATED FROM PYTHON SOURCE LINES 97-103 Spectrogram ----------- To get the frequency make-up of an audio signal as it varies with time, you can use :py:func:`torchaudio.transforms.Spectrogram`. .. GENERATED FROM PYTHON SOURCE LINES 103-110 .. code-block:: default SPEECH_WAVEFORM, SAMPLE_RATE = torchaudio.load(SAMPLE_SPEECH) plot_waveform(SPEECH_WAVEFORM, SAMPLE_RATE, title="Original waveform") Audio(SPEECH_WAVEFORM.numpy(), rate=SAMPLE_RATE) .. image-sg:: /tutorials/images/sphx_glr_audio_feature_extractions_tutorial_001.png :alt: Original waveform :srcset: /tutorials/images/sphx_glr_audio_feature_extractions_tutorial_001.png :class: sphx-glr-single-img .. raw:: html


.. GENERATED FROM PYTHON SOURCE LINES 112-127 .. code-block:: default n_fft = 1024 win_length = None hop_length = 512 # Define transform spectrogram = T.Spectrogram( n_fft=n_fft, win_length=win_length, hop_length=hop_length, center=True, pad_mode="reflect", power=2.0, ) .. GENERATED FROM PYTHON SOURCE LINES 129-133 .. code-block:: default # Perform transform spec = spectrogram(SPEECH_WAVEFORM) .. GENERATED FROM PYTHON SOURCE LINES 135-138 .. code-block:: default plot_spectrogram(spec[0], title="torchaudio") .. image-sg:: /tutorials/images/sphx_glr_audio_feature_extractions_tutorial_002.png :alt: torchaudio :srcset: /tutorials/images/sphx_glr_audio_feature_extractions_tutorial_002.png :class: sphx-glr-single-img .. GENERATED FROM PYTHON SOURCE LINES 139-144 GriffinLim ---------- To recover a waveform from a spectrogram, you can use ``GriffinLim``. .. GENERATED FROM PYTHON SOURCE LINES 144-157 .. code-block:: default torch.random.manual_seed(0) n_fft = 1024 win_length = None hop_length = 512 spec = T.Spectrogram( n_fft=n_fft, win_length=win_length, hop_length=hop_length, )(SPEECH_WAVEFORM) .. GENERATED FROM PYTHON SOURCE LINES 159-166 .. code-block:: default griffin_lim = T.GriffinLim( n_fft=n_fft, win_length=win_length, hop_length=hop_length, ) .. GENERATED FROM PYTHON SOURCE LINES 168-171 .. code-block:: default reconstructed_waveform = griffin_lim(spec) .. GENERATED FROM PYTHON SOURCE LINES 173-177 .. code-block:: default plot_waveform(reconstructed_waveform, SAMPLE_RATE, title="Reconstructed") Audio(reconstructed_waveform, rate=SAMPLE_RATE) .. image-sg:: /tutorials/images/sphx_glr_audio_feature_extractions_tutorial_003.png :alt: Reconstructed :srcset: /tutorials/images/sphx_glr_audio_feature_extractions_tutorial_003.png :class: sphx-glr-single-img .. raw:: html


.. GENERATED FROM PYTHON SOURCE LINES 178-187 Mel Filter Bank --------------- :py:func:`torchaudio.functional.melscale_fbanks` generates the filter bank for converting frequency bins to mel-scale bins. Since this function does not require input audio/features, there is no equivalent transform in :py:func:`torchaudio.transforms`. .. GENERATED FROM PYTHON SOURCE LINES 187-201 .. code-block:: default n_fft = 256 n_mels = 64 sample_rate = 6000 mel_filters = F.melscale_fbanks( int(n_fft // 2 + 1), n_mels=n_mels, f_min=0.0, f_max=sample_rate / 2.0, sample_rate=sample_rate, norm="slaney", ) .. GENERATED FROM PYTHON SOURCE LINES 203-206 .. code-block:: default plot_fbank(mel_filters, "Mel Filter Bank - torchaudio") .. image-sg:: /tutorials/images/sphx_glr_audio_feature_extractions_tutorial_004.png :alt: Mel Filter Bank - torchaudio :srcset: /tutorials/images/sphx_glr_audio_feature_extractions_tutorial_004.png :class: sphx-glr-single-img .. GENERATED FROM PYTHON SOURCE LINES 207-213 Comparison against librosa ~~~~~~~~~~~~~~~~~~~~~~~~~~ For reference, here is the equivalent way to get the mel filter bank with ``librosa``. .. GENERATED FROM PYTHON SOURCE LINES 213-224 .. code-block:: default mel_filters_librosa = librosa.filters.mel( sr=sample_rate, n_fft=n_fft, n_mels=n_mels, fmin=0.0, fmax=sample_rate / 2.0, norm="slaney", htk=True, ).T .. GENERATED FROM PYTHON SOURCE LINES 226-232 .. code-block:: default plot_fbank(mel_filters_librosa, "Mel Filter Bank - librosa") mse = torch.square(mel_filters - mel_filters_librosa).mean().item() print("Mean Square Difference: ", mse) .. image-sg:: /tutorials/images/sphx_glr_audio_feature_extractions_tutorial_005.png :alt: Mel Filter Bank - librosa :srcset: /tutorials/images/sphx_glr_audio_feature_extractions_tutorial_005.png :class: sphx-glr-single-img .. rst-class:: sphx-glr-script-out .. code-block:: none Mean Square Difference: 3.795462323290159e-17 .. GENERATED FROM PYTHON SOURCE LINES 233-241 MelSpectrogram -------------- Generating a mel-scale spectrogram involves generating a spectrogram and performing mel-scale conversion. In ``torchaudio``, :py:func:`torchaudio.transforms.MelSpectrogram` provides this functionality. .. GENERATED FROM PYTHON SOURCE LINES 241-263 .. code-block:: default n_fft = 1024 win_length = None hop_length = 512 n_mels = 128 mel_spectrogram = T.MelSpectrogram( sample_rate=sample_rate, n_fft=n_fft, win_length=win_length, hop_length=hop_length, center=True, pad_mode="reflect", power=2.0, norm="slaney", onesided=True, n_mels=n_mels, mel_scale="htk", ) melspec = mel_spectrogram(SPEECH_WAVEFORM) .. GENERATED FROM PYTHON SOURCE LINES 265-268 .. code-block:: default plot_spectrogram(melspec[0], title="MelSpectrogram - torchaudio", ylabel="mel freq") .. image-sg:: /tutorials/images/sphx_glr_audio_feature_extractions_tutorial_006.png :alt: MelSpectrogram - torchaudio :srcset: /tutorials/images/sphx_glr_audio_feature_extractions_tutorial_006.png :class: sphx-glr-single-img .. GENERATED FROM PYTHON SOURCE LINES 269-275 Comparison against librosa ~~~~~~~~~~~~~~~~~~~~~~~~~~ For reference, here is the equivalent means of generating mel-scale spectrograms with ``librosa``. .. GENERATED FROM PYTHON SOURCE LINES 275-290 .. code-block:: default melspec_librosa = librosa.feature.melspectrogram( y=SPEECH_WAVEFORM.numpy()[0], sr=sample_rate, n_fft=n_fft, hop_length=hop_length, win_length=win_length, center=True, pad_mode="reflect", power=2.0, n_mels=n_mels, norm="slaney", htk=True, ) .. GENERATED FROM PYTHON SOURCE LINES 292-298 .. code-block:: default plot_spectrogram(melspec_librosa, title="MelSpectrogram - librosa", ylabel="mel freq") mse = torch.square(melspec - melspec_librosa).mean().item() print("Mean Square Difference: ", mse) .. image-sg:: /tutorials/images/sphx_glr_audio_feature_extractions_tutorial_007.png :alt: MelSpectrogram - librosa :srcset: /tutorials/images/sphx_glr_audio_feature_extractions_tutorial_007.png :class: sphx-glr-single-img .. rst-class:: sphx-glr-script-out .. code-block:: none Mean Square Difference: 1.0343034206883317e-09 .. GENERATED FROM PYTHON SOURCE LINES 299-302 MFCC ---- .. GENERATED FROM PYTHON SOURCE LINES 302-322 .. code-block:: default n_fft = 2048 win_length = None hop_length = 512 n_mels = 256 n_mfcc = 256 mfcc_transform = T.MFCC( sample_rate=sample_rate, n_mfcc=n_mfcc, melkwargs={ "n_fft": n_fft, "n_mels": n_mels, "hop_length": hop_length, "mel_scale": "htk", }, ) mfcc = mfcc_transform(SPEECH_WAVEFORM) .. GENERATED FROM PYTHON SOURCE LINES 324-327 .. code-block:: default plot_spectrogram(mfcc[0]) .. image-sg:: /tutorials/images/sphx_glr_audio_feature_extractions_tutorial_008.png :alt: Spectrogram (db) :srcset: /tutorials/images/sphx_glr_audio_feature_extractions_tutorial_008.png :class: sphx-glr-single-img .. GENERATED FROM PYTHON SOURCE LINES 328-331 Comparison against librosa ~~~~~~~~~~~~~~~~~~~~~~~~~~ .. GENERATED FROM PYTHON SOURCE LINES 331-350 .. code-block:: default melspec = librosa.feature.melspectrogram( y=SPEECH_WAVEFORM.numpy()[0], sr=sample_rate, n_fft=n_fft, win_length=win_length, hop_length=hop_length, n_mels=n_mels, htk=True, norm=None, ) mfcc_librosa = librosa.feature.mfcc( S=librosa.core.spectrum.power_to_db(melspec), n_mfcc=n_mfcc, dct_type=2, norm="ortho", ) .. GENERATED FROM PYTHON SOURCE LINES 352-358 .. code-block:: default plot_spectrogram(mfcc_librosa) mse = torch.square(mfcc - mfcc_librosa).mean().item() print("Mean Square Difference: ", mse) .. image-sg:: /tutorials/images/sphx_glr_audio_feature_extractions_tutorial_009.png :alt: Spectrogram (db) :srcset: /tutorials/images/sphx_glr_audio_feature_extractions_tutorial_009.png :class: sphx-glr-single-img .. rst-class:: sphx-glr-script-out .. code-block:: none Mean Square Difference: 0.8103950023651123 .. GENERATED FROM PYTHON SOURCE LINES 359-362 LFCC ---- .. GENERATED FROM PYTHON SOURCE LINES 362-381 .. code-block:: default n_fft = 2048 win_length = None hop_length = 512 n_lfcc = 256 lfcc_transform = T.LFCC( sample_rate=sample_rate, n_lfcc=n_lfcc, speckwargs={ "n_fft": n_fft, "win_length": win_length, "hop_length": hop_length, }, ) lfcc = lfcc_transform(SPEECH_WAVEFORM) plot_spectrogram(lfcc[0]) .. image-sg:: /tutorials/images/sphx_glr_audio_feature_extractions_tutorial_010.png :alt: Spectrogram (db) :srcset: /tutorials/images/sphx_glr_audio_feature_extractions_tutorial_010.png :class: sphx-glr-single-img .. GENERATED FROM PYTHON SOURCE LINES 382-385 Pitch ----- .. GENERATED FROM PYTHON SOURCE LINES 385-388 .. code-block:: default pitch = F.detect_pitch_frequency(SPEECH_WAVEFORM, SAMPLE_RATE) .. GENERATED FROM PYTHON SOURCE LINES 390-410 .. code-block:: default def plot_pitch(waveform, sr, pitch): figure, axis = plt.subplots(1, 1) axis.set_title("Pitch Feature") axis.grid(True) end_time = waveform.shape[1] / sr time_axis = torch.linspace(0, end_time, waveform.shape[1]) axis.plot(time_axis, waveform[0], linewidth=1, color="gray", alpha=0.3) axis2 = axis.twinx() time_axis = torch.linspace(0, end_time, pitch.shape[1]) axis2.plot(time_axis, pitch[0], linewidth=2, label="Pitch", color="green") axis2.legend(loc=0) plt.show(block=False) plot_pitch(SPEECH_WAVEFORM, SAMPLE_RATE, pitch) .. image-sg:: /tutorials/images/sphx_glr_audio_feature_extractions_tutorial_011.png :alt: Pitch Feature :srcset: /tutorials/images/sphx_glr_audio_feature_extractions_tutorial_011.png :class: sphx-glr-single-img .. GENERATED FROM PYTHON SOURCE LINES 411-429 Kaldi Pitch (beta) ------------------ Kaldi Pitch feature [1] is a pitch detection mechanism tuned for automatic speech recognition (ASR) applications. This is a beta feature in ``torchaudio``, and it is available as :py:func:`torchaudio.functional.compute_kaldi_pitch`. 1. A pitch extraction algorithm tuned for automatic speech recognition Ghahremani, B. BabaAli, D. Povey, K. Riedhammer, J. Trmal and S. Khudanpur 2014 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), Florence, 2014, pp. 2494-2498, doi: 10.1109/ICASSP.2014.6854049. [`abstract `__], [`paper `__] .. GENERATED FROM PYTHON SOURCE LINES 429-433 .. code-block:: default pitch_feature = F.compute_kaldi_pitch(SPEECH_WAVEFORM, SAMPLE_RATE) pitch, nfcc = pitch_feature[..., 0], pitch_feature[..., 1] .. GENERATED FROM PYTHON SOURCE LINES 435-460 .. code-block:: default def plot_kaldi_pitch(waveform, sr, pitch, nfcc): _, axis = plt.subplots(1, 1) axis.set_title("Kaldi Pitch Feature") axis.grid(True) end_time = waveform.shape[1] / sr time_axis = torch.linspace(0, end_time, waveform.shape[1]) axis.plot(time_axis, waveform[0], linewidth=1, color="gray", alpha=0.3) time_axis = torch.linspace(0, end_time, pitch.shape[1]) ln1 = axis.plot(time_axis, pitch[0], linewidth=2, label="Pitch", color="green") axis.set_ylim((-1.3, 1.3)) axis2 = axis.twinx() time_axis = torch.linspace(0, end_time, nfcc.shape[1]) ln2 = axis2.plot(time_axis, nfcc[0], linewidth=2, label="NFCC", color="blue", linestyle="--") lns = ln1 + ln2 labels = [l.get_label() for l in lns] axis.legend(lns, labels, loc=0) plt.show(block=False) plot_kaldi_pitch(SPEECH_WAVEFORM, SAMPLE_RATE, pitch, nfcc) .. image-sg:: /tutorials/images/sphx_glr_audio_feature_extractions_tutorial_012.png :alt: Kaldi Pitch Feature :srcset: /tutorials/images/sphx_glr_audio_feature_extractions_tutorial_012.png :class: sphx-glr-single-img .. rst-class:: sphx-glr-timing **Total running time of the script:** ( 0 minutes 5.581 seconds) .. _sphx_glr_download_tutorials_audio_feature_extractions_tutorial.py: .. only:: html .. container:: sphx-glr-footer sphx-glr-footer-example .. container:: sphx-glr-download sphx-glr-download-python :download:`Download Python source code: audio_feature_extractions_tutorial.py ` .. container:: sphx-glr-download sphx-glr-download-jupyter :download:`Download Jupyter notebook: audio_feature_extractions_tutorial.ipynb ` .. only:: html .. rst-class:: sphx-glr-signature `Gallery generated by Sphinx-Gallery `_