.. DO NOT EDIT. .. THIS FILE WAS AUTOMATICALLY GENERATED BY SPHINX-GALLERY. .. TO MAKE CHANGES, EDIT THE SOURCE PYTHON FILE: .. "beginner/flava_finetuning_tutorial.py" .. LINE NUMBERS ARE GIVEN BELOW. .. only:: html .. note:: :class: sphx-glr-download-link-note Click :ref:`here ` to download the full example code .. rst-class:: sphx-glr-example-title .. _sphx_glr_beginner_flava_finetuning_tutorial.py: TorchMultimodal Tutorial: Finetuning FLAVA ============================================ .. GENERATED FROM PYTHON SOURCE LINES 8-22 Multimodal AI has recently become very popular owing to its ubiquitous nature, from use cases like image captioning and visual search to more recent applications like image generation from text. **TorchMultimodal is a library powered by Pytorch consisting of building blocks and end to end examples, aiming to enable and accelerate research in multimodality**. In this tutorial, we will demonstrate how to use a **pretrained SoTA model called** `FLAVA `__ **from TorchMultimodal library to finetune on a multimodal task i.e. visual question answering** (VQA). The model consists of two unimodal transformer based encoders for text and image and a multimodal encoder to combine the two embeddings. It is pretrained using contrastive, image text matching and text, image and multimodal masking losses. .. GENERATED FROM PYTHON SOURCE LINES 25-41 Installation ----------------- We will use TextVQA dataset and ``bert tokenizer`` from Hugging Face for this tutorial. So you need to install datasets and transformers in addition to TorchMultimodal. .. note:: When running this tutorial in Google Colab, install the required packages by creating a new cell and running the following commands: .. code-block:: !pip install torchmultimodal-nightly !pip install datasets !pip install transformers .. GENERATED FROM PYTHON SOURCE LINES 43-70 Steps ----- 1. Download the Hugging Face dataset to a directory on your computer by running the following command: .. code-block:: wget http://dl.fbaipublicfiles.com/pythia/data/vocab.tar.gz tar xf vocab.tar.gz .. note:: If you are running this tutorial in Google Colab, run these commands in a new cell and prepend these commands with an exclamation mark (!) 2. For this tutorial, we treat VQA as a classification task where the inputs are images and question (text) and the output is an answer class. So we need to download the vocab file with answer classes and create the answer to label mapping. We also load the `textvqa dataset `__ containing 34602 training samples (images,questions and answers) from Hugging Face We see there are 3997 answer classes including a class representing unknown answers. .. GENERATED FROM PYTHON SOURCE LINES 70-83 .. code-block:: default with open("data/vocabs/answers_textvqa_more_than_1.txt") as f: vocab = f.readlines() answer_to_idx = {} for idx, entry in enumerate(vocab): answer_to_idx[entry.strip("\n")] = idx print(len(vocab)) print(vocab[:5]) from datasets import load_dataset dataset = load_dataset("textvqa") .. rst-class:: sphx-glr-script-out .. code-block:: none 3997 ['\n', 'nokia\n', 'ec\n', 'virgin\n', '2011\n'] Downloading builder script: 0%| | 0.00/5.02k [00:00= MAX_STEPS-1: break .. rst-class:: sphx-glr-script-out .. code-block:: none Loss at step 0 = 8.290360450744629 Loss at step 1 = 8.358966827392578 Loss at step 2 = 8.274675369262695 .. GENERATED FROM PYTHON SOURCE LINES 180-191 Conclusion ------------------- This tutorial introduced the basics around how to finetune on a multimodal task using FLAVA from TorchMultimodal. Please also check out other examples from the library like `MDETR `__ which is a multimodal model for object detection and `Omnivore `__ which is multitask model spanning image, video and 3d classification. .. rst-class:: sphx-glr-timing **Total running time of the script:** ( 2 minutes 40.586 seconds) .. _sphx_glr_download_beginner_flava_finetuning_tutorial.py: .. only:: html .. container:: sphx-glr-footer sphx-glr-footer-example .. container:: sphx-glr-download sphx-glr-download-python :download:`Download Python source code: flava_finetuning_tutorial.py ` .. container:: sphx-glr-download sphx-glr-download-jupyter :download:`Download Jupyter notebook: flava_finetuning_tutorial.ipynb ` .. only:: html .. rst-class:: sphx-glr-signature `Gallery generated by Sphinx-Gallery `_