.. DO NOT EDIT.
.. THIS FILE WAS AUTOMATICALLY GENERATED BY SPHINX-GALLERY.
.. TO MAKE CHANGES, EDIT THE SOURCE PYTHON FILE:
.. "tutorials/device_avsr.py"
.. LINE NUMBERS ARE GIVEN BELOW.

.. only:: html

    .. note::
        :class: sphx-glr-download-link-note

        Click :ref:`here <sphx_glr_download_tutorials_device_avsr.py>`
        to download the full example code

.. rst-class:: sphx-glr-example-title

.. _sphx_glr_tutorials_device_avsr.py:


Device AV-ASR with Emformer RNN-T
=================================

**Author**: `Pingchuan Ma <pingchuanma@meta.com>`__, `Moto
Hira <moto@meta.com>`__.

This tutorial shows how to run on-device audio-visual speech recognition
(AV-ASR, or AVSR) with TorchAudio on a streaming device input,
i.e. microphone on laptop. AV-ASR is the task of transcribing text from
audio and visual streams, which has recently attracted a lot of research
attention due to its robustness against noise.

.. note::

   This tutorial requires ffmpeg, sentencepiece, mediapipe,
   opencv-python and scikit-image libraries.

   There are multiple ways to install ffmpeg libraries.
   If you are using Anaconda Python
   distribution, ``conda install -c conda-forge 'ffmpeg<7'`` will
   install compatible FFmpeg libraries.

   You can run
   ``pip install sentencepiece mediapipe opencv-python scikit-image`` to
   install the other libraries mentioned.

.. note::

   To run this tutorial, please make sure you are in the `tutorial` folder.

.. note::

   We tested the tutorial on torchaudio version 2.0.2 on Macbook Pro (M1 Pro).

.. GENERATED FROM PYTHON SOURCE LINES 37-45

.. code-block:: default


    import numpy as np
    import sentencepiece as spm
    import torch
    import torchaudio
    import torchvision


.. GENERATED FROM PYTHON SOURCE LINES 46-59

Overview
--------

The real-time AV-ASR system is presented as follows, which consists of
three components, a data collection module, a pre-processing module and
an end-to-end model. The data collection module is hardware, such as a
microphone and camera. Its role is to collect information from the real
world. Once the information is collected, the pre-processing module
location and crop out face. Next, we feed the raw audio stream and the
pre-processed video stream into our end-to-end model for inference.

.. image:: https://download.pytorch.org/torchaudio/doc-assets/avsr/overview.png


.. GENERATED FROM PYTHON SOURCE LINES 62-72

1. Data acquisition
-------------------

Firstly, we define the function to collect videos from microphone and
camera. To be specific, we use :py:class:`~torchaudio.io.StreamReader`
class for the purpose of data collection, which supports capturing
audio/video from microphone and camera. For the detailed usage of this
class, please refer to the
`tutorial <./streamreader_basic_tutorial.html>`__.


.. GENERATED FROM PYTHON SOURCE LINES 72-114

.. code-block:: default


    def stream(q, format, option, src, segment_length, sample_rate):
        print("Building StreamReader...")
        streamer = torchaudio.io.StreamReader(src=src, format=format, option=option)
        streamer.add_basic_video_stream(frames_per_chunk=segment_length, buffer_chunk_size=500, width=600, height=340)
        streamer.add_basic_audio_stream(frames_per_chunk=segment_length * 640, sample_rate=sample_rate)

        print(streamer.get_src_stream_info(0))
        print(streamer.get_src_stream_info(1))
        print("Streaming...")
        print()
        for (chunk_v, chunk_a) in streamer.stream(timeout=-1, backoff=1.0):
            q.put([chunk_v, chunk_a])


    class ContextCacher:
        def __init__(self, segment_length: int, context_length: int, rate_ratio: int):
            self.segment_length = segment_length
            self.context_length = context_length

            self.context_length_v = context_length
            self.context_length_a = context_length * rate_ratio
            self.context_v = torch.zeros([self.context_length_v, 3, 340, 600])
            self.context_a = torch.zeros([self.context_length_a, 1])

        def __call__(self, chunk_v, chunk_a):
            if chunk_v.size(0) < self.segment_length:
                chunk_v = torch.nn.functional.pad(chunk_v, (0, 0, 0, 0, 0, 0, 0, self.segment_length - chunk_v.size(0)))
            if chunk_a.size(0) < self.segment_length * 640:
                chunk_a = torch.nn.functional.pad(chunk_a, (0, 0, 0, self.segment_length * 640 - chunk_a.size(0)))

            if self.context_length == 0:
                return chunk_v.float(), chunk_a.float()
            else:
                chunk_with_context_v = torch.cat((self.context_v, chunk_v))
                chunk_with_context_a = torch.cat((self.context_a, chunk_a))
                self.context_v = chunk_v[-self.context_length_v :]
                self.context_a = chunk_a[-self.context_length_a :]
                return chunk_with_context_v.float(), chunk_with_context_a.float()


.. GENERATED FROM PYTHON SOURCE LINES 115-140

2. Pre-processing
-----------------

Before feeding the raw stream into our model, each video sequence has to
undergo a specific pre-processing procedure. This involves three
critical steps. The first step is to perform face detection. Following
that, each individual frame is aligned to a referenced frame, commonly
known as the mean face, in order to normalize rotation and size
differences across frames. The final step in the pre-processing module
is to crop the face region from the aligned face image.

.. list-table::
   :widths: 25 25 25 25
   :header-rows: 0

   * - .. image:: https://download.pytorch.org/torchaudio/doc-assets/avsr/original.gif
     - .. image:: https://download.pytorch.org/torchaudio/doc-assets/avsr/detected.gif
     - .. image:: https://download.pytorch.org/torchaudio/doc-assets/avsr/transformed.gif
     - .. image:: https://download.pytorch.org/torchaudio/doc-assets/avsr/cropped.gif

   * - 0. Original
     - 1. Detected
     - 2. Transformed
     - 3. Cropped


.. GENERATED FROM PYTHON SOURCE LINES 140-183

.. code-block:: default


    import sys

    sys.path.insert(0, "../../examples")

    from avsr.data_prep.detectors.mediapipe.detector import LandmarksDetector
    from avsr.data_prep.detectors.mediapipe.video_process import VideoProcess


    class FunctionalModule(torch.nn.Module):
        def __init__(self, functional):
            super().__init__()
            self.functional = functional

        def forward(self, input):
            return self.functional(input)


    class Preprocessing(torch.nn.Module):
        def __init__(self):
            super().__init__()
            self.landmarks_detector = LandmarksDetector()
            self.video_process = VideoProcess()
            self.video_transform = torch.nn.Sequential(
                FunctionalModule(
                    lambda n: [(lambda x: torchvision.transforms.functional.resize(x, 44, antialias=True))(i) for i in n]
                ),
                FunctionalModule(lambda x: torch.stack(x)),
                torchvision.transforms.Normalize(0.0, 255.0),
                torchvision.transforms.Grayscale(),
                torchvision.transforms.Normalize(0.421, 0.165),
            )

        def forward(self, audio, video):
            video = video.permute(0, 2, 3, 1).cpu().numpy().astype(np.uint8)
            landmarks = self.landmarks_detector(video)
            video = self.video_process(video, landmarks)
            video = torch.tensor(video).permute(0, 3, 1, 2).float()
            video = self.video_transform(video)
            audio = audio.mean(axis=-1, keepdim=True)
            return audio, video


.. GENERATED FROM PYTHON SOURCE LINES 184-198

3. Building inference pipeline
------------------------------

The next step is to create components required for pipeline.

We use convolutional-based front-ends to extract features from both the
raw audio and video streams. These features are then passed through a
two-layer MLP for fusion. For our transducer model, we leverage the
TorchAudio library, which incorporates an encoder (Emformer), a
predictor, and a joint network. The architecture of the proposed AV-ASR
model is illustrated as follows.

.. image:: https://download.pytorch.org/torchaudio/doc-assets/avsr/architecture.png


.. GENERATED FROM PYTHON SOURCE LINES 198-258

.. code-block:: default


    class SentencePieceTokenProcessor:
        def __init__(self, sp_model):
            self.sp_model = sp_model
            self.post_process_remove_list = {
                self.sp_model.unk_id(),
                self.sp_model.eos_id(),
                self.sp_model.pad_id(),
            }

        def __call__(self, tokens, lstrip: bool = True) -> str:
            filtered_hypo_tokens = [
                token_index for token_index in tokens[1:] if token_index not in self.post_process_remove_list
            ]
            output_string = "".join(self.sp_model.id_to_piece(filtered_hypo_tokens)).replace("\u2581", " ")

            if lstrip:
                return output_string.lstrip()
            else:
                return output_string


    class InferencePipeline(torch.nn.Module):
        def __init__(self, preprocessor, model, decoder, token_processor):
            super().__init__()
            self.preprocessor = preprocessor
            self.model = model
            self.decoder = decoder
            self.token_processor = token_processor

            self.state = None
            self.hypotheses = None

        def forward(self, audio, video):
            audio, video = self.preprocessor(audio, video)
            feats = self.model(audio.unsqueeze(0), video.unsqueeze(0))
            length = torch.tensor([feats.size(1)], device=audio.device)
            self.hypotheses, self.state = self.decoder.infer(feats, length, 10, state=self.state, hypothesis=self.hypotheses)
            transcript = self.token_processor(self.hypotheses[0][0], lstrip=False)
            return transcript


    def _get_inference_pipeline(model_path, spm_model_path):
        model = torch.jit.load(model_path)
        model.eval()

        sp_model = spm.SentencePieceProcessor(model_file=spm_model_path)
        token_processor = SentencePieceTokenProcessor(sp_model)

        decoder = torchaudio.models.RNNTBeamSearch(model.model, sp_model.get_piece_size())

        return InferencePipeline(
            preprocessor=Preprocessing(),
            model=model,
            decoder=decoder,
            token_processor=token_processor,
        )


.. GENERATED FROM PYTHON SOURCE LINES 259-269

4. The main process
-------------------

The execution flow of the main process is as follows:

1. Initialize the inference pipeline.
2. Launch data acquisition subprocess.
3. Run inference.
4. Clean up


.. GENERATED FROM PYTHON SOURCE LINES 269-328

.. code-block:: default


    from torchaudio.utils import download_asset


    def main(device, src, option=None):
        print("Building pipeline...")
        model_path = download_asset("tutorial-assets/device_avsr_model.pt")
        spm_model_path = download_asset("tutorial-assets/spm_unigram_1023.model")

        pipeline = _get_inference_pipeline(model_path, spm_model_path)

        BUFFER_SIZE = 32
        segment_length = 8
        context_length = 4
        sample_rate = 19200
        frame_rate = 30
        rate_ratio = sample_rate // frame_rate
        cacher = ContextCacher(BUFFER_SIZE, context_length, rate_ratio)

        import torch.multiprocessing as mp

        ctx = mp.get_context("spawn")

        @torch.inference_mode()
        def infer():
            num_video_frames = 0
            video_chunks = []
            audio_chunks = []
            while True:
                chunk_v, chunk_a = q.get()
                num_video_frames += chunk_a.size(0) // 640
                video_chunks.append(chunk_v)
                audio_chunks.append(chunk_a)
                if num_video_frames < BUFFER_SIZE:
                    continue
                video = torch.cat(video_chunks)
                audio = torch.cat(audio_chunks)
                video, audio = cacher(video, audio)
                pipeline.state, pipeline.hypotheses = None, None
                transcript = pipeline(audio, video.float())
                print(transcript, end="", flush=True)
                num_video_frames = 0
                video_chunks = []
                audio_chunks = []

        q = ctx.Queue()
        p = ctx.Process(target=stream, args=(q, device, option, src, segment_length, sample_rate))
        p.start()
        infer()
        p.join()


    if __name__ == "__main__":
        main(
            device="avfoundation",
            src="0:1",
            option={"framerate": "30", "pixel_format": "rgb24"},
        )


.. GENERATED FROM PYTHON SOURCE LINES 329-339

.. code::

   Building pipeline...
   Building StreamReader...
   SourceVideoStream(media_type='video', codec='rawvideo', codec_long_name='raw video', format='uyvy422', bit_rate=0, num_frames=0, bits_per_sample=0, metadata={}, width=1552, height=1552, frame_rate=1000000.0)
   SourceAudioStream(media_type='audio', codec='pcm_f32le', codec_long_name='PCM 32-bit floating point little-endian', format='flt', bit_rate=1536000, num_frames=0, bits_per_sample=0, metadata={}, sample_rate=48000.0, num_channels=1)
   Streaming...

   hello world


.. GENERATED FROM PYTHON SOURCE LINES 342-344

Tag: :obj:`torchaudio.io`


.. rst-class:: sphx-glr-timing

   **Total running time of the script:** ( 0 minutes  0.000 seconds)


.. _sphx_glr_download_tutorials_device_avsr.py:

.. only:: html

  .. container:: sphx-glr-footer sphx-glr-footer-example


    .. container:: sphx-glr-download sphx-glr-download-python

      :download:`Download Python source code: device_avsr.py <device_avsr.py>`

    .. container:: sphx-glr-download sphx-glr-download-jupyter

      :download:`Download Jupyter notebook: device_avsr.ipynb <device_avsr.ipynb>`


.. only:: html

 .. rst-class:: sphx-glr-signature

    `Gallery generated by Sphinx-Gallery <https://sphinx-gallery.github.io>`_