.. DO NOT EDIT. .. THIS FILE WAS AUTOMATICALLY GENERATED BY SPHINX-GALLERY. .. TO MAKE CHANGES, EDIT THE SOURCE PYTHON FILE: .. "tutorials/audio_feature_extractions_tutorial.py" .. LINE NUMBERS ARE GIVEN BELOW. .. only:: html .. note:: :class: sphx-glr-download-link-note Click :ref:`here ` to download the full example code .. rst-class:: sphx-glr-example-title .. _sphx_glr_tutorials_audio_feature_extractions_tutorial.py: Audio Feature Extractions ========================= ``torchaudio`` implements feature extractions commonly used in the audio domain. They are available in ``torchaudio.functional`` and ``torchaudio.transforms``. ``functional`` implements features as standalone functions. They are stateless. ``transforms`` implements features as objects, using implementations from ``functional`` and ``torch.nn.Module``. They can be serialized using TorchScript. .. GENERATED FROM PYTHON SOURCE LINES 17-26 .. code-block:: default import torch import torchaudio import torchaudio.functional as F import torchaudio.transforms as T print(torch.__version__) print(torchaudio.__version__) .. rst-class:: sphx-glr-script-out Out: .. code-block:: none 1.12.0 0.12.0 .. GENERATED FROM PYTHON SOURCE LINES 27-38 Preparation ----------- .. note:: When running this tutorial in Google Colab, install the required packages .. code:: !pip install librosa .. GENERATED FROM PYTHON SOURCE LINES 38-80 .. code-block:: default from IPython.display import Audio import librosa import matplotlib.pyplot as plt from torchaudio.utils import download_asset torch.random.manual_seed(0) SAMPLE_SPEECH = download_asset("tutorial-assets/Lab41-SRI-VOiCES-src-sp0307-ch127535-sg0042.wav") def plot_waveform(waveform, sr, title="Waveform"): waveform = waveform.numpy() num_channels, num_frames = waveform.shape time_axis = torch.arange(0, num_frames) / sr figure, axes = plt.subplots(num_channels, 1) axes.plot(time_axis, waveform[0], linewidth=1) axes.grid(True) figure.suptitle(title) plt.show(block=False) def plot_spectrogram(specgram, title=None, ylabel="freq_bin"): fig, axs = plt.subplots(1, 1) axs.set_title(title or "Spectrogram (db)") axs.set_ylabel(ylabel) axs.set_xlabel("frame") im = axs.imshow(librosa.power_to_db(specgram), origin="lower", aspect="auto") fig.colorbar(im, ax=axs) plt.show(block=False) def plot_fbank(fbank, title=None): fig, axs = plt.subplots(1, 1) axs.set_title(title or "Filter bank") axs.imshow(fbank, aspect="auto") axs.set_ylabel("frequency bin") axs.set_xlabel("mel bin") plt.show(block=False) .. GENERATED FROM PYTHON SOURCE LINES 81-92 Overview of audio features -------------------------- The following diagram shows the relationship between common audio features and torchaudio APIs to generate them. .. image:: https://download.pytorch.org/torchaudio/tutorial-assets/torchaudio_feature_extractions.png For the complete list of available features, please refer to the documentation. .. GENERATED FROM PYTHON SOURCE LINES 95-101 Spectrogram ----------- To get the frequency make-up of an audio signal as it varies with time, you can use :py:func:`torchaudio.transforms.Spectrogram`. .. GENERATED FROM PYTHON SOURCE LINES 101-108 .. code-block:: default SPEECH_WAVEFORM, SAMPLE_RATE = torchaudio.load(SAMPLE_SPEECH) plot_waveform(SPEECH_WAVEFORM, SAMPLE_RATE, title="Original waveform") Audio(SPEECH_WAVEFORM.numpy(), rate=SAMPLE_RATE) .. image-sg:: /tutorials/images/sphx_glr_audio_feature_extractions_tutorial_001.png :alt: Original waveform :srcset: /tutorials/images/sphx_glr_audio_feature_extractions_tutorial_001.png :class: sphx-glr-single-img .. raw:: html


.. GENERATED FROM PYTHON SOURCE LINES 110-125 .. code-block:: default n_fft = 1024 win_length = None hop_length = 512 # Define transform spectrogram = T.Spectrogram( n_fft=n_fft, win_length=win_length, hop_length=hop_length, center=True, pad_mode="reflect", power=2.0, ) .. GENERATED FROM PYTHON SOURCE LINES 127-131 .. code-block:: default # Perform transform spec = spectrogram(SPEECH_WAVEFORM) .. GENERATED FROM PYTHON SOURCE LINES 133-136 .. code-block:: default plot_spectrogram(spec[0], title="torchaudio") .. image-sg:: /tutorials/images/sphx_glr_audio_feature_extractions_tutorial_002.png :alt: torchaudio :srcset: /tutorials/images/sphx_glr_audio_feature_extractions_tutorial_002.png :class: sphx-glr-single-img .. GENERATED FROM PYTHON SOURCE LINES 137-142 GriffinLim ---------- To recover a waveform from a spectrogram, you can use ``GriffinLim``. .. GENERATED FROM PYTHON SOURCE LINES 142-155 .. code-block:: default torch.random.manual_seed(0) n_fft = 1024 win_length = None hop_length = 512 spec = T.Spectrogram( n_fft=n_fft, win_length=win_length, hop_length=hop_length, )(SPEECH_WAVEFORM) .. GENERATED FROM PYTHON SOURCE LINES 157-164 .. code-block:: default griffin_lim = T.GriffinLim( n_fft=n_fft, win_length=win_length, hop_length=hop_length, ) .. GENERATED FROM PYTHON SOURCE LINES 166-169 .. code-block:: default reconstructed_waveform = griffin_lim(spec) .. GENERATED FROM PYTHON SOURCE LINES 171-175 .. code-block:: default plot_waveform(reconstructed_waveform, SAMPLE_RATE, title="Reconstructed") Audio(reconstructed_waveform, rate=SAMPLE_RATE) .. image-sg:: /tutorials/images/sphx_glr_audio_feature_extractions_tutorial_003.png :alt: Reconstructed :srcset: /tutorials/images/sphx_glr_audio_feature_extractions_tutorial_003.png :class: sphx-glr-single-img .. raw:: html


.. GENERATED FROM PYTHON SOURCE LINES 176-185 Mel Filter Bank --------------- :py:func:`torchaudio.functional.melscale_fbanks` generates the filter bank for converting frequency bins to mel-scale bins. Since this function does not require input audio/features, there is no equivalent transform in :py:func:`torchaudio.transforms`. .. GENERATED FROM PYTHON SOURCE LINES 185-199 .. code-block:: default n_fft = 256 n_mels = 64 sample_rate = 6000 mel_filters = F.melscale_fbanks( int(n_fft // 2 + 1), n_mels=n_mels, f_min=0.0, f_max=sample_rate / 2.0, sample_rate=sample_rate, norm="slaney", ) .. GENERATED FROM PYTHON SOURCE LINES 201-204 .. code-block:: default plot_fbank(mel_filters, "Mel Filter Bank - torchaudio") .. image-sg:: /tutorials/images/sphx_glr_audio_feature_extractions_tutorial_004.png :alt: Mel Filter Bank - torchaudio :srcset: /tutorials/images/sphx_glr_audio_feature_extractions_tutorial_004.png :class: sphx-glr-single-img .. GENERATED FROM PYTHON SOURCE LINES 205-211 Comparison against librosa ~~~~~~~~~~~~~~~~~~~~~~~~~~ For reference, here is the equivalent way to get the mel filter bank with ``librosa``. .. GENERATED FROM PYTHON SOURCE LINES 211-222 .. code-block:: default mel_filters_librosa = librosa.filters.mel( sr=sample_rate, n_fft=n_fft, n_mels=n_mels, fmin=0.0, fmax=sample_rate / 2.0, norm="slaney", htk=True, ).T .. GENERATED FROM PYTHON SOURCE LINES 224-230 .. code-block:: default plot_fbank(mel_filters_librosa, "Mel Filter Bank - librosa") mse = torch.square(mel_filters - mel_filters_librosa).mean().item() print("Mean Square Difference: ", mse) .. image-sg:: /tutorials/images/sphx_glr_audio_feature_extractions_tutorial_005.png :alt: Mel Filter Bank - librosa :srcset: /tutorials/images/sphx_glr_audio_feature_extractions_tutorial_005.png :class: sphx-glr-single-img .. rst-class:: sphx-glr-script-out Out: .. code-block:: none Mean Square Difference: 3.84594449432978e-17 .. GENERATED FROM PYTHON SOURCE LINES 231-239 MelSpectrogram -------------- Generating a mel-scale spectrogram involves generating a spectrogram and performing mel-scale conversion. In ``torchaudio``, :py:func:`torchaudio.transforms.MelSpectrogram` provides this functionality. .. GENERATED FROM PYTHON SOURCE LINES 239-261 .. code-block:: default n_fft = 1024 win_length = None hop_length = 512 n_mels = 128 mel_spectrogram = T.MelSpectrogram( sample_rate=sample_rate, n_fft=n_fft, win_length=win_length, hop_length=hop_length, center=True, pad_mode="reflect", power=2.0, norm="slaney", onesided=True, n_mels=n_mels, mel_scale="htk", ) melspec = mel_spectrogram(SPEECH_WAVEFORM) .. GENERATED FROM PYTHON SOURCE LINES 263-266 .. code-block:: default plot_spectrogram(melspec[0], title="MelSpectrogram - torchaudio", ylabel="mel freq") .. image-sg:: /tutorials/images/sphx_glr_audio_feature_extractions_tutorial_006.png :alt: MelSpectrogram - torchaudio :srcset: /tutorials/images/sphx_glr_audio_feature_extractions_tutorial_006.png :class: sphx-glr-single-img .. GENERATED FROM PYTHON SOURCE LINES 267-273 Comparison against librosa ~~~~~~~~~~~~~~~~~~~~~~~~~~ For reference, here is the equivalent means of generating mel-scale spectrograms with ``librosa``. .. GENERATED FROM PYTHON SOURCE LINES 273-288 .. code-block:: default melspec_librosa = librosa.feature.melspectrogram( y=SPEECH_WAVEFORM.numpy()[0], sr=sample_rate, n_fft=n_fft, hop_length=hop_length, win_length=win_length, center=True, pad_mode="reflect", power=2.0, n_mels=n_mels, norm="slaney", htk=True, ) .. GENERATED FROM PYTHON SOURCE LINES 290-296 .. code-block:: default plot_spectrogram(melspec_librosa, title="MelSpectrogram - librosa", ylabel="mel freq") mse = torch.square(melspec - melspec_librosa).mean().item() print("Mean Square Difference: ", mse) .. image-sg:: /tutorials/images/sphx_glr_audio_feature_extractions_tutorial_007.png :alt: MelSpectrogram - librosa :srcset: /tutorials/images/sphx_glr_audio_feature_extractions_tutorial_007.png :class: sphx-glr-single-img .. rst-class:: sphx-glr-script-out Out: .. code-block:: none Mean Square Difference: 1.0186037568971074e-09 .. GENERATED FROM PYTHON SOURCE LINES 297-300 MFCC ---- .. GENERATED FROM PYTHON SOURCE LINES 300-320 .. code-block:: default n_fft = 2048 win_length = None hop_length = 512 n_mels = 256 n_mfcc = 256 mfcc_transform = T.MFCC( sample_rate=sample_rate, n_mfcc=n_mfcc, melkwargs={ "n_fft": n_fft, "n_mels": n_mels, "hop_length": hop_length, "mel_scale": "htk", }, ) mfcc = mfcc_transform(SPEECH_WAVEFORM) .. GENERATED FROM PYTHON SOURCE LINES 322-325 .. code-block:: default plot_spectrogram(mfcc[0]) .. image-sg:: /tutorials/images/sphx_glr_audio_feature_extractions_tutorial_008.png :alt: Spectrogram (db) :srcset: /tutorials/images/sphx_glr_audio_feature_extractions_tutorial_008.png :class: sphx-glr-single-img .. GENERATED FROM PYTHON SOURCE LINES 326-329 Comparison against librosa ~~~~~~~~~~~~~~~~~~~~~~~~~~ .. GENERATED FROM PYTHON SOURCE LINES 329-348 .. code-block:: default melspec = librosa.feature.melspectrogram( y=SPEECH_WAVEFORM.numpy()[0], sr=sample_rate, n_fft=n_fft, win_length=win_length, hop_length=hop_length, n_mels=n_mels, htk=True, norm=None, ) mfcc_librosa = librosa.feature.mfcc( S=librosa.core.spectrum.power_to_db(melspec), n_mfcc=n_mfcc, dct_type=2, norm="ortho", ) .. GENERATED FROM PYTHON SOURCE LINES 350-356 .. code-block:: default plot_spectrogram(mfcc_librosa) mse = torch.square(mfcc - mfcc_librosa).mean().item() print("Mean Square Difference: ", mse) .. image-sg:: /tutorials/images/sphx_glr_audio_feature_extractions_tutorial_009.png :alt: Spectrogram (db) :srcset: /tutorials/images/sphx_glr_audio_feature_extractions_tutorial_009.png :class: sphx-glr-single-img .. rst-class:: sphx-glr-script-out Out: .. code-block:: none Mean Square Difference: 0.8103954195976257 .. GENERATED FROM PYTHON SOURCE LINES 357-360 LFCC ---- .. GENERATED FROM PYTHON SOURCE LINES 360-379 .. code-block:: default n_fft = 2048 win_length = None hop_length = 512 n_lfcc = 256 lfcc_transform = T.LFCC( sample_rate=sample_rate, n_lfcc=n_lfcc, speckwargs={ "n_fft": n_fft, "win_length": win_length, "hop_length": hop_length, }, ) lfcc = lfcc_transform(SPEECH_WAVEFORM) plot_spectrogram(lfcc[0]) .. image-sg:: /tutorials/images/sphx_glr_audio_feature_extractions_tutorial_010.png :alt: Spectrogram (db) :srcset: /tutorials/images/sphx_glr_audio_feature_extractions_tutorial_010.png :class: sphx-glr-single-img .. GENERATED FROM PYTHON SOURCE LINES 380-383 Pitch ----- .. GENERATED FROM PYTHON SOURCE LINES 383-386 .. code-block:: default pitch = F.detect_pitch_frequency(SPEECH_WAVEFORM, SAMPLE_RATE) .. GENERATED FROM PYTHON SOURCE LINES 388-408 .. code-block:: default def plot_pitch(waveform, sr, pitch): figure, axis = plt.subplots(1, 1) axis.set_title("Pitch Feature") axis.grid(True) end_time = waveform.shape[1] / sr time_axis = torch.linspace(0, end_time, waveform.shape[1]) axis.plot(time_axis, waveform[0], linewidth=1, color="gray", alpha=0.3) axis2 = axis.twinx() time_axis = torch.linspace(0, end_time, pitch.shape[1]) axis2.plot(time_axis, pitch[0], linewidth=2, label="Pitch", color="green") axis2.legend(loc=0) plt.show(block=False) plot_pitch(SPEECH_WAVEFORM, SAMPLE_RATE, pitch) .. image-sg:: /tutorials/images/sphx_glr_audio_feature_extractions_tutorial_011.png :alt: Pitch Feature :srcset: /tutorials/images/sphx_glr_audio_feature_extractions_tutorial_011.png :class: sphx-glr-single-img .. GENERATED FROM PYTHON SOURCE LINES 409-427 Kaldi Pitch (beta) ------------------ Kaldi Pitch feature [1] is a pitch detection mechanism tuned for automatic speech recognition (ASR) applications. This is a beta feature in ``torchaudio``, and it is available as :py:func:`torchaudio.functional.compute_kaldi_pitch`. 1. A pitch extraction algorithm tuned for automatic speech recognition Ghahremani, B. BabaAli, D. Povey, K. Riedhammer, J. Trmal and S. Khudanpur 2014 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), Florence, 2014, pp. 2494-2498, doi: 10.1109/ICASSP.2014.6854049. [`abstract `__], [`paper `__] .. GENERATED FROM PYTHON SOURCE LINES 427-431 .. code-block:: default pitch_feature = F.compute_kaldi_pitch(SPEECH_WAVEFORM, SAMPLE_RATE) pitch, nfcc = pitch_feature[..., 0], pitch_feature[..., 1] .. GENERATED FROM PYTHON SOURCE LINES 433-458 .. code-block:: default def plot_kaldi_pitch(waveform, sr, pitch, nfcc): _, axis = plt.subplots(1, 1) axis.set_title("Kaldi Pitch Feature") axis.grid(True) end_time = waveform.shape[1] / sr time_axis = torch.linspace(0, end_time, waveform.shape[1]) axis.plot(time_axis, waveform[0], linewidth=1, color="gray", alpha=0.3) time_axis = torch.linspace(0, end_time, pitch.shape[1]) ln1 = axis.plot(time_axis, pitch[0], linewidth=2, label="Pitch", color="green") axis.set_ylim((-1.3, 1.3)) axis2 = axis.twinx() time_axis = torch.linspace(0, end_time, nfcc.shape[1]) ln2 = axis2.plot(time_axis, nfcc[0], linewidth=2, label="NFCC", color="blue", linestyle="--") lns = ln1 + ln2 labels = [l.get_label() for l in lns] axis.legend(lns, labels, loc=0) plt.show(block=False) plot_kaldi_pitch(SPEECH_WAVEFORM, SAMPLE_RATE, pitch, nfcc) .. image-sg:: /tutorials/images/sphx_glr_audio_feature_extractions_tutorial_012.png :alt: Kaldi Pitch Feature :srcset: /tutorials/images/sphx_glr_audio_feature_extractions_tutorial_012.png :class: sphx-glr-single-img .. rst-class:: sphx-glr-timing **Total running time of the script:** ( 0 minutes 5.984 seconds) .. _sphx_glr_download_tutorials_audio_feature_extractions_tutorial.py: .. only :: html .. container:: sphx-glr-footer :class: sphx-glr-footer-example .. container:: sphx-glr-download sphx-glr-download-python :download:`Download Python source code: audio_feature_extractions_tutorial.py ` .. container:: sphx-glr-download sphx-glr-download-jupyter :download:`Download Jupyter notebook: audio_feature_extractions_tutorial.ipynb ` .. only:: html .. rst-class:: sphx-glr-signature `Gallery generated by Sphinx-Gallery `_