Video MViT¶
The MViT model is based on the MViTv2: Improved Multiscale Vision Transformers for Classification and Detection and Multiscale Vision Transformers papers.
Model builders¶
The following model builders can be used to instantiate a MViT v1 or v2 model, with or
without pre-trained weights. All the model builders internally rely on the
torchvision.models.video.MViT
base class. Please refer to the source
code for
more details about this class.
|
Constructs a base MViTV1 architecture from Multiscale Vision Transformers. |
|
Constructs a small MViTV2 architecture from Multiscale Vision Transformers and MViTv2: Improved Multiscale Vision Transformers for Classification and Detection. |