torchvision.models¶
The models subpackage contains definitions of models for addressing different tasks, including: image classification, pixelwise semantic segmentation, object detection, instance segmentation, person keypoint detection and video classification.
Classification¶
The models subpackage contains definitions for the following model architectures for image classification:
 AlexNet
 VGG
 ResNet
 SqueezeNet
 DenseNet
 Inception v3
 GoogLeNet
 ShuffleNet v2
 MobileNetV2
 MobileNetV3
 ResNeXt
 Wide ResNet
 MNASNet
You can construct a model with random weights by calling its constructor:
import torchvision.models as models
resnet18 = models.resnet18()
alexnet = models.alexnet()
vgg16 = models.vgg16()
squeezenet = models.squeezenet1_0()
densenet = models.densenet161()
inception = models.inception_v3()
googlenet = models.googlenet()
shufflenet = models.shufflenet_v2_x1_0()
mobilenet_v2 = models.mobilenet_v2()
mobilenet_v3_large = models.mobilenet_v3_large()
mobilenet_v3_small = models.mobilenet_v3_small()
resnext50_32x4d = models.resnext50_32x4d()
wide_resnet50_2 = models.wide_resnet50_2()
mnasnet = models.mnasnet1_0()
We provide pretrained models, using the PyTorch torch.utils.model_zoo
.
These can be constructed by passing pretrained=True
:
import torchvision.models as models
resnet18 = models.resnet18(pretrained=True)
alexnet = models.alexnet(pretrained=True)
squeezenet = models.squeezenet1_0(pretrained=True)
vgg16 = models.vgg16(pretrained=True)
densenet = models.densenet161(pretrained=True)
inception = models.inception_v3(pretrained=True)
googlenet = models.googlenet(pretrained=True)
shufflenet = models.shufflenet_v2_x1_0(pretrained=True)
mobilenet_v2 = models.mobilenet_v2(pretrained=True)
mobilenet_v3_large = models.mobilenet_v3_large(pretrained=True)
mobilenet_v3_small = models.mobilenet_v3_small(pretrained=True)
resnext50_32x4d = models.resnext50_32x4d(pretrained=True)
wide_resnet50_2 = models.wide_resnet50_2(pretrained=True)
mnasnet = models.mnasnet1_0(pretrained=True)
Instancing a pretrained model will download its weights to a cache directory.
This directory can be set using the TORCH_MODEL_ZOO environment variable. See
torch.utils.model_zoo.load_url()
for details.
Some models use modules which have different training and evaluation
behavior, such as batch normalization. To switch between these modes, use
model.train()
or model.eval()
as appropriate. See
train()
or eval()
for details.
All pretrained models expect input images normalized in the same way,
i.e. minibatches of 3channel RGB images of shape (3 x H x W),
where H and W are expected to be at least 224.
The images have to be loaded in to a range of [0, 1] and then normalized
using mean = [0.485, 0.456, 0.406]
and std = [0.229, 0.224, 0.225]
.
You can use the following transform to normalize:
normalize = transforms.Normalize(mean=[0.485, 0.456, 0.406],
std=[0.229, 0.224, 0.225])
An example of such normalization can be found in the imagenet example here
The process for obtaining the values of mean and std is roughly equivalent to:
import torch
from torchvision import datasets, transforms as T
transform = T.Compose([T.Resize(256), T.CenterCrop(224), T.ToTensor()])
dataset = datasets.ImageNet(".", split="train", transform=transform)
means = []
stds = []
for img in subset(dataset):
means.append(torch.mean(img))
stds.append(torch.std(img))
mean = torch.mean(torch.tensor(means))
std = torch.mean(torch.tensor(stds))
Unfortunately, the concrete subset that was used is lost. For more information see this discussion or these experiments.
ImageNet 1crop error rates (224x224)
Model  Acc@1  Acc@5 

AlexNet  56.522  79.066 
VGG11  69.020  88.628 
VGG13  69.928  89.246 
VGG16  71.592  90.382 
VGG19  72.376  90.876 
VGG11 with batch normalization  70.370  89.810 
VGG13 with batch normalization  71.586  90.374 
VGG16 with batch normalization  73.360  91.516 
VGG19 with batch normalization  74.218  91.842 
ResNet18  69.758  89.078 
ResNet34  73.314  91.420 
ResNet50  76.130  92.862 
ResNet101  77.374  93.546 
ResNet152  78.312  94.046 
SqueezeNet 1.0  58.092  80.420 
SqueezeNet 1.1  58.178  80.624 
Densenet121  74.434  91.972 
Densenet169  75.600  92.806 
Densenet201  76.896  93.370 
Densenet161  77.138  93.560 
Inception v3  77.294  93.450 
GoogleNet  69.778  89.530 
ShuffleNet V2 x1.0  69.362  88.316 
ShuffleNet V2 x0.5  60.552  81.746 
MobileNet V2  71.878  90.286 
MobileNet V3 Large  74.042  91.340 
MobileNet V3 Small  67.668  87.402 
ResNeXt5032x4d  77.618  93.698 
ResNeXt10132x8d  79.312  94.526 
Wide ResNet502  78.468  94.086 
Wide ResNet1012  78.848  94.284 
MNASNet 1.0  73.456  91.510 
MNASNet 0.5  67.734  87.490 
Alexnet¶

torchvision.models.
alexnet
(pretrained: bool = False, progress: bool = True, **kwargs) → torchvision.models.alexnet.AlexNet[source]¶ AlexNet model architecture from the “One weird trick…” paper.
Parameters:
VGG¶

torchvision.models.
vgg11
(pretrained: bool = False, progress: bool = True, **kwargs) → torchvision.models.vgg.VGG[source]¶ VGG 11layer model (configuration “A”) from “Very Deep Convolutional Networks For LargeScale Image Recognition” <https://arxiv.org/pdf/1409.1556.pdf>._
Parameters:

torchvision.models.
vgg11_bn
(pretrained: bool = False, progress: bool = True, **kwargs) → torchvision.models.vgg.VGG[source]¶ VGG 11layer model (configuration “A”) with batch normalization “Very Deep Convolutional Networks For LargeScale Image Recognition” <https://arxiv.org/pdf/1409.1556.pdf>._
Parameters:

torchvision.models.
vgg13
(pretrained: bool = False, progress: bool = True, **kwargs) → torchvision.models.vgg.VGG[source]¶ VGG 13layer model (configuration “B”) “Very Deep Convolutional Networks For LargeScale Image Recognition” <https://arxiv.org/pdf/1409.1556.pdf>._
Parameters:

torchvision.models.
vgg13_bn
(pretrained: bool = False, progress: bool = True, **kwargs) → torchvision.models.vgg.VGG[source]¶ VGG 13layer model (configuration “B”) with batch normalization “Very Deep Convolutional Networks For LargeScale Image Recognition” <https://arxiv.org/pdf/1409.1556.pdf>._
Parameters:

torchvision.models.
vgg16
(pretrained: bool = False, progress: bool = True, **kwargs) → torchvision.models.vgg.VGG[source]¶ VGG 16layer model (configuration “D”) “Very Deep Convolutional Networks For LargeScale Image Recognition” <https://arxiv.org/pdf/1409.1556.pdf>._
Parameters:

torchvision.models.
vgg16_bn
(pretrained: bool = False, progress: bool = True, **kwargs) → torchvision.models.vgg.VGG[source]¶ VGG 16layer model (configuration “D”) with batch normalization “Very Deep Convolutional Networks For LargeScale Image Recognition” <https://arxiv.org/pdf/1409.1556.pdf>._
Parameters:

torchvision.models.
vgg19
(pretrained: bool = False, progress: bool = True, **kwargs) → torchvision.models.vgg.VGG[source]¶ VGG 19layer model (configuration “E”) “Very Deep Convolutional Networks For LargeScale Image Recognition” <https://arxiv.org/pdf/1409.1556.pdf>._
Parameters:

torchvision.models.
vgg19_bn
(pretrained: bool = False, progress: bool = True, **kwargs) → torchvision.models.vgg.VGG[source]¶ VGG 19layer model (configuration ‘E’) with batch normalization “Very Deep Convolutional Networks For LargeScale Image Recognition” <https://arxiv.org/pdf/1409.1556.pdf>._
Parameters:
ResNet¶

torchvision.models.
resnet18
(pretrained: bool = False, progress: bool = True, **kwargs) → torchvision.models.resnet.ResNet[source]¶ ResNet18 model from “Deep Residual Learning for Image Recognition”.
Parameters:

torchvision.models.
resnet34
(pretrained: bool = False, progress: bool = True, **kwargs) → torchvision.models.resnet.ResNet[source]¶ ResNet34 model from “Deep Residual Learning for Image Recognition”.
Parameters:

torchvision.models.
resnet50
(pretrained: bool = False, progress: bool = True, **kwargs) → torchvision.models.resnet.ResNet[source]¶ ResNet50 model from “Deep Residual Learning for Image Recognition”.
Parameters:

torchvision.models.
resnet101
(pretrained: bool = False, progress: bool = True, **kwargs) → torchvision.models.resnet.ResNet[source]¶ ResNet101 model from “Deep Residual Learning for Image Recognition”.
Parameters:

torchvision.models.
resnet152
(pretrained: bool = False, progress: bool = True, **kwargs) → torchvision.models.resnet.ResNet[source]¶ ResNet152 model from “Deep Residual Learning for Image Recognition”.
Parameters:
SqueezeNet¶

torchvision.models.
squeezenet1_0
(pretrained: bool = False, progress: bool = True, **kwargs) → torchvision.models.squeezenet.SqueezeNet[source]¶ SqueezeNet model architecture from the “SqueezeNet: AlexNetlevel accuracy with 50x fewer parameters and <0.5MB model size” paper.
Parameters:

torchvision.models.
squeezenet1_1
(pretrained: bool = False, progress: bool = True, **kwargs) → torchvision.models.squeezenet.SqueezeNet[source]¶ SqueezeNet 1.1 model from the official SqueezeNet repo. SqueezeNet 1.1 has 2.4x less computation and slightly fewer parameters than SqueezeNet 1.0, without sacrificing accuracy.
Parameters:
DenseNet¶

torchvision.models.
densenet121
(pretrained: bool = False, progress: bool = True, **kwargs) → torchvision.models.densenet.DenseNet[source]¶ Densenet121 model from “Densely Connected Convolutional Networks”.
Parameters:

torchvision.models.
densenet169
(pretrained: bool = False, progress: bool = True, **kwargs) → torchvision.models.densenet.DenseNet[source]¶ Densenet169 model from “Densely Connected Convolutional Networks”.
Parameters:

torchvision.models.
densenet161
(pretrained: bool = False, progress: bool = True, **kwargs) → torchvision.models.densenet.DenseNet[source]¶ Densenet161 model from “Densely Connected Convolutional Networks”.
Parameters:

torchvision.models.
densenet201
(pretrained: bool = False, progress: bool = True, **kwargs) → torchvision.models.densenet.DenseNet[source]¶ Densenet201 model from “Densely Connected Convolutional Networks”.
Parameters:
Inception v3¶

torchvision.models.
inception_v3
(pretrained: bool = False, progress: bool = True, **kwargs) → torchvision.models.inception.Inception3[source]¶ Inception v3 model architecture from “Rethinking the Inception Architecture for Computer Vision”.
Note
Important: In contrast to the other models the inception_v3 expects tensors with a size of N x 3 x 299 x 299, so ensure your images are sized accordingly.
Parameters:  pretrained (bool) – If True, returns a model pretrained on ImageNet
 progress (bool) – If True, displays a progress bar of the download to stderr
 aux_logits (bool) – If True, add an auxiliary branch that can improve training. Default: True
 transform_input (bool) – If True, preprocesses the input according to the method with which it was trained on ImageNet. Default: False
Note
This requires scipy to be installed
GoogLeNet¶

torchvision.models.
googlenet
(pretrained: bool = False, progress: bool = True, **kwargs) → torchvision.models.googlenet.GoogLeNet[source]¶ GoogLeNet (Inception v1) model architecture from “Going Deeper with Convolutions”.
Parameters:  pretrained (bool) – If True, returns a model pretrained on ImageNet
 progress (bool) – If True, displays a progress bar of the download to stderr
 aux_logits (bool) – If True, adds two auxiliary branches that can improve training. Default: False when pretrained is True otherwise True
 transform_input (bool) – If True, preprocesses the input according to the method with which it was trained on ImageNet. Default: False
Note
This requires scipy to be installed
ShuffleNet v2¶

torchvision.models.
shufflenet_v2_x0_5
(pretrained: bool = False, progress: bool = True, **kwargs) → torchvision.models.shufflenetv2.ShuffleNetV2[source]¶ Constructs a ShuffleNetV2 with 0.5x output channels, as described in “ShuffleNet V2: Practical Guidelines for Efficient CNN Architecture Design”.
Parameters:

torchvision.models.
shufflenet_v2_x1_0
(pretrained: bool = False, progress: bool = True, **kwargs) → torchvision.models.shufflenetv2.ShuffleNetV2[source]¶ Constructs a ShuffleNetV2 with 1.0x output channels, as described in “ShuffleNet V2: Practical Guidelines for Efficient CNN Architecture Design”.
Parameters:

torchvision.models.
shufflenet_v2_x1_5
(pretrained: bool = False, progress: bool = True, **kwargs) → torchvision.models.shufflenetv2.ShuffleNetV2[source]¶ Constructs a ShuffleNetV2 with 1.5x output channels, as described in “ShuffleNet V2: Practical Guidelines for Efficient CNN Architecture Design”.
Parameters:

torchvision.models.
shufflenet_v2_x2_0
(pretrained: bool = False, progress: bool = True, **kwargs) → torchvision.models.shufflenetv2.ShuffleNetV2[source]¶ Constructs a ShuffleNetV2 with 2.0x output channels, as described in “ShuffleNet V2: Practical Guidelines for Efficient CNN Architecture Design”.
Parameters:
MobileNet v2¶

torchvision.models.
mobilenet_v2
(pretrained: bool = False, progress: bool = True, **kwargs) → torchvision.models.mobilenetv2.MobileNetV2[source]¶ Constructs a MobileNetV2 architecture from “MobileNetV2: Inverted Residuals and Linear Bottlenecks”.
Parameters:
MobileNet v3¶

torchvision.models.
mobilenet_v3_large
(pretrained: bool = False, progress: bool = True, **kwargs) → torchvision.models.mobilenetv3.MobileNetV3[source]¶ Constructs a large MobileNetV3 architecture from “Searching for MobileNetV3”.
Parameters:

torchvision.models.
mobilenet_v3_small
(pretrained: bool = False, progress: bool = True, **kwargs) → torchvision.models.mobilenetv3.MobileNetV3[source]¶ Constructs a small MobileNetV3 architecture from “Searching for MobileNetV3”.
Parameters:
ResNext¶

torchvision.models.
resnext50_32x4d
(pretrained: bool = False, progress: bool = True, **kwargs) → torchvision.models.resnet.ResNet[source]¶ ResNeXt50 32x4d model from “Aggregated Residual Transformation for Deep Neural Networks”.
Parameters:

torchvision.models.
resnext101_32x8d
(pretrained: bool = False, progress: bool = True, **kwargs) → torchvision.models.resnet.ResNet[source]¶ ResNeXt101 32x8d model from “Aggregated Residual Transformation for Deep Neural Networks”.
Parameters:
Wide ResNet¶

torchvision.models.
wide_resnet50_2
(pretrained: bool = False, progress: bool = True, **kwargs) → torchvision.models.resnet.ResNet[source]¶ Wide ResNet502 model from “Wide Residual Networks”.
The model is the same as ResNet except for the bottleneck number of channels which is twice larger in every block. The number of channels in outer 1x1 convolutions is the same, e.g. last block in ResNet50 has 20485122048 channels, and in Wide ResNet502 has 204810242048.
Parameters:

torchvision.models.
wide_resnet101_2
(pretrained: bool = False, progress: bool = True, **kwargs) → torchvision.models.resnet.ResNet[source]¶ Wide ResNet1012 model from “Wide Residual Networks”.
The model is the same as ResNet except for the bottleneck number of channels which is twice larger in every block. The number of channels in outer 1x1 convolutions is the same, e.g. last block in ResNet50 has 20485122048 channels, and in Wide ResNet502 has 204810242048.
Parameters:
MNASNet¶

torchvision.models.
mnasnet0_5
(pretrained: bool = False, progress: bool = True, **kwargs) → torchvision.models.mnasnet.MNASNet[source]¶ MNASNet with depth multiplier of 0.5 from “MnasNet: PlatformAware Neural Architecture Search for Mobile”.
Parameters:

torchvision.models.
mnasnet0_75
(pretrained: bool = False, progress: bool = True, **kwargs) → torchvision.models.mnasnet.MNASNet[source]¶ MNASNet with depth multiplier of 0.75 from “MnasNet: PlatformAware Neural Architecture Search for Mobile”.
Parameters:

torchvision.models.
mnasnet1_0
(pretrained: bool = False, progress: bool = True, **kwargs) → torchvision.models.mnasnet.MNASNet[source]¶ MNASNet with depth multiplier of 1.0 from “MnasNet: PlatformAware Neural Architecture Search for Mobile”.
Parameters:

torchvision.models.
mnasnet1_3
(pretrained: bool = False, progress: bool = True, **kwargs) → torchvision.models.mnasnet.MNASNet[source]¶ MNASNet with depth multiplier of 1.3 from “MnasNet: PlatformAware Neural Architecture Search for Mobile”.
Parameters:
Quantized Models¶
The following architectures provide support for INT8 quantized models. You can get a model with random weights by calling its constructor:
import torchvision.models as models
googlenet = models.quantization.googlenet()
inception_v3 = models.quantization.inception_v3()
mobilenet_v2 = models.quantization.mobilenet_v2()
mobilenet_v3_large = models.quantization.mobilenet_v3_large()
resnet18 = models.quantization.resnet18()
resnet50 = models.quantization.resnet50()
resnext101_32x8d = models.quantization.resnext101_32x8d()
shufflenet_v2_x0_5 = models.quantization.shufflenet_v2_x0_5()
shufflenet_v2_x1_0 = models.quantization.shufflenet_v2_x1_0()
shufflenet_v2_x1_5 = models.quantization.shufflenet_v2_x1_5()
shufflenet_v2_x2_0 = models.quantization.shufflenet_v2_x2_0()
Obtaining a pretrained quantized model can be done with a few lines of code:
import torchvision.models as models
model = models.quantization.mobilenet_v2(pretrained=True, quantize=True)
model.eval()
# run the model with quantized inputs and weights
out = model(torch.rand(1, 3, 224, 224))
We provide pretrained quantized weights for the following models:
Model  Acc@1  Acc@5 

MobileNet V2  71.658  90.150 
MobileNet V3 Large  73.004  90.858 
ShuffleNet V2  68.360  87.582 
ResNet 18  69.494  88.882 
ResNet 50  75.920  92.814 
ResNext 101 32x8d  78.986  94.480 
Inception V3  77.176  93.354 
GoogleNet  69.826  89.404 
Semantic Segmentation¶
The models subpackage contains definitions for the following model architectures for semantic segmentation:
As with image classification models, all pretrained models expect input images normalized in the same way.
The images have to be loaded in to a range of [0, 1]
and then normalized using
mean = [0.485, 0.456, 0.406]
and std = [0.229, 0.224, 0.225]
.
They have been trained on images resized such that their minimum size is 520.
The pretrained models have been trained on a subset of COCO train2017, on the 20 categories that are
present in the Pascal VOC dataset. You can see more information on how the subset has been selected in
references/segmentation/coco_utils.py
. The classes that the pretrained model outputs are the following,
in order:
['__background__', 'aeroplane', 'bicycle', 'bird', 'boat', 'bottle', 'bus', 'car', 'cat', 'chair', 'cow', 'diningtable', 'dog', 'horse', 'motorbike', 'person', 'pottedplant', 'sheep', 'sofa', 'train', 'tvmonitor']
The accuracies of the pretrained models evaluated on COCO val2017 are as follows
Network  mean IoU  global pixelwise acc 

FCN ResNet50  60.5  91.4 
FCN ResNet101  63.7  91.9 
DeepLabV3 ResNet50  66.4  92.4 
DeepLabV3 ResNet101  67.4  92.4 
DeepLabV3 MobileNetV3Large  60.3  91.2 
LRASPP MobileNetV3Large  57.9  91.2 
Fully Convolutional Networks¶

torchvision.models.segmentation.
fcn_resnet50
(pretrained=False, progress=True, num_classes=21, aux_loss=None, **kwargs)[source]¶ Constructs a FullyConvolutional Network model with a ResNet50 backbone.
Parameters:  pretrained (bool) – If True, returns a model pretrained on COCO train2017 which contains the same classes as Pascal VOC
 progress (bool) – If True, displays a progress bar of the download to stderr
 num_classes (int) – number of output classes of the model (including the background)
 aux_loss (bool) – If True, it uses an auxiliary loss

torchvision.models.segmentation.
fcn_resnet101
(pretrained=False, progress=True, num_classes=21, aux_loss=None, **kwargs)[source]¶ Constructs a FullyConvolutional Network model with a ResNet101 backbone.
Parameters:  pretrained (bool) – If True, returns a model pretrained on COCO train2017 which contains the same classes as Pascal VOC
 progress (bool) – If True, displays a progress bar of the download to stderr
 num_classes (int) – number of output classes of the model (including the background)
 aux_loss (bool) – If True, it uses an auxiliary loss
DeepLabV3¶

torchvision.models.segmentation.
deeplabv3_resnet50
(pretrained=False, progress=True, num_classes=21, aux_loss=None, **kwargs)[source]¶ Constructs a DeepLabV3 model with a ResNet50 backbone.
Parameters:  pretrained (bool) – If True, returns a model pretrained on COCO train2017 which contains the same classes as Pascal VOC
 progress (bool) – If True, displays a progress bar of the download to stderr
 num_classes (int) – number of output classes of the model (including the background)
 aux_loss (bool) – If True, it uses an auxiliary loss

torchvision.models.segmentation.
deeplabv3_resnet101
(pretrained=False, progress=True, num_classes=21, aux_loss=None, **kwargs)[source]¶ Constructs a DeepLabV3 model with a ResNet101 backbone.
Parameters:

torchvision.models.segmentation.
deeplabv3_mobilenet_v3_large
(pretrained=False, progress=True, num_classes=21, aux_loss=None, **kwargs)[source]¶ Constructs a DeepLabV3 model with a MobileNetV3Large backbone.
Parameters:  pretrained (bool) – If True, returns a model pretrained on COCO train2017 which contains the same classes as Pascal VOC
 progress (bool) – If True, displays a progress bar of the download to stderr
 num_classes (int) – number of output classes of the model (including the background)
 aux_loss (bool) – If True, it uses an auxiliary loss
Object Detection, Instance Segmentation and Person Keypoint Detection¶
The models subpackage contains definitions for the following model architectures for detection:
The pretrained models for detection, instance segmentation and keypoint detection are initialized with the classification models in torchvision.
The models expect a list of Tensor[C, H, W]
, in the range 01
.
The models internally resize the images so that they have a minimum size
of 800
. This option can be changed by passing the option min_size
to the constructor of the models.
For object detection and instance segmentation, the pretrained models return the predictions of the following classes:
COCO_INSTANCE_CATEGORY_NAMES = [ '__background__', 'person', 'bicycle', 'car', 'motorcycle', 'airplane', 'bus', 'train', 'truck', 'boat', 'traffic light', 'fire hydrant', 'N/A', 'stop sign', 'parking meter', 'bench', 'bird', 'cat', 'dog', 'horse', 'sheep', 'cow', 'elephant', 'bear', 'zebra', 'giraffe', 'N/A', 'backpack', 'umbrella', 'N/A', 'N/A', 'handbag', 'tie', 'suitcase', 'frisbee', 'skis', 'snowboard', 'sports ball', 'kite', 'baseball bat', 'baseball glove', 'skateboard', 'surfboard', 'tennis racket', 'bottle', 'N/A', 'wine glass', 'cup', 'fork', 'knife', 'spoon', 'bowl', 'banana', 'apple', 'sandwich', 'orange', 'broccoli', 'carrot', 'hot dog', 'pizza', 'donut', 'cake', 'chair', 'couch', 'potted plant', 'bed', 'N/A', 'dining table', 'N/A', 'N/A', 'toilet', 'N/A', 'tv', 'laptop', 'mouse', 'remote', 'keyboard', 'cell phone', 'microwave', 'oven', 'toaster', 'sink', 'refrigerator', 'N/A', 'book', 'clock', 'vase', 'scissors', 'teddy bear', 'hair drier', 'toothbrush' ]
Here are the summary of the accuracies for the models trained on the instances set of COCO train2017 and evaluated on COCO val2017.
Network  box AP  mask AP  keypoint AP 

Faster RCNN ResNet50 FPN  37.0  
Faster RCNN MobileNetV3Large FPN  32.8  
Faster RCNN MobileNetV3Large 320 FPN  22.8  
RetinaNet ResNet50 FPN  36.4  
Mask RCNN ResNet50 FPN  37.9  34.6 
For person keypoint detection, the accuracies for the pretrained models are as follows
Network  box AP  mask AP  keypoint AP 

Keypoint RCNN ResNet50 FPN  54.6  65.0 
For person keypoint detection, the pretrained model return the keypoints in the following order:
COCO_PERSON_KEYPOINT_NAMES = [ 'nose', 'left_eye', 'right_eye', 'left_ear', 'right_ear', 'left_shoulder', 'right_shoulder', 'left_elbow', 'right_elbow', 'left_wrist', 'right_wrist', 'left_hip', 'right_hip', 'left_knee', 'right_knee', 'left_ankle', 'right_ankle' ]
Runtime characteristics¶
The implementations of the models for object detection, instance segmentation and keypoint detection are efficient.
In the following table, we use 8 V100 GPUs, with CUDA 10.0 and CUDNN 7.4 to report the results. During training, we use a batch size of 2 per GPU, and during testing a batch size of 1 is used.
For test time, we report the time for the model evaluation and postprocessing (including mask pasting in image), but not the time for computing the precisionrecall.
Network  train time (s / it)  test time (s / it)  memory (GB) 

Faster RCNN ResNet50 FPN  0.2288  0.0590  5.2 
Faster RCNN MobileNetV3Large FPN  0.1020  0.0415  1.0 
Faster RCNN MobileNetV3Large 320 FPN  0.0978  0.0376  0.6 
RetinaNet ResNet50 FPN  0.2514  0.0939  4.1 
Mask RCNN ResNet50 FPN  0.2728  0.0903  5.4 
Keypoint RCNN ResNet50 FPN  0.3789  0.1242  6.8 
Faster RCNN¶

torchvision.models.detection.
fasterrcnn_resnet50_fpn
(pretrained=False, progress=True, num_classes=91, pretrained_backbone=True, trainable_backbone_layers=None, **kwargs)[source]¶ Constructs a Faster RCNN model with a ResNet50FPN backbone.
The input to the model is expected to be a list of tensors, each of shape
[C, H, W]
, one for each image, and should be in01
range. Different images can have different sizes.The behavior of the model changes depending if it is in training or evaluation mode.
During training, the model expects both the input tensors, as well as a targets (list of dictionary), containing:
 boxes (
FloatTensor[N, 4]
): the groundtruth boxes in[x1, y1, x2, y2]
format, with0 <= x1 < x2 <= W
and0 <= y1 < y2 <= H
.  labels (
Int64Tensor[N]
): the class label for each groundtruth box
The model returns a
Dict[Tensor]
during training, containing the classification and regression losses for both the RPN and the RCNN.During inference, the model requires only the input tensors, and returns the postprocessed predictions as a
List[Dict[Tensor]]
, one for each input image. The fields of theDict
are as follows: boxes (
FloatTensor[N, 4]
): the predicted boxes in[x1, y1, x2, y2]
format, with0 <= x1 < x2 <= W
and0 <= y1 < y2 <= H
.  labels (
Int64Tensor[N]
): the predicted labels for each image  scores (
Tensor[N]
): the scores or each prediction
Faster RCNN is exportable to ONNX for a fixed batch size with inputs images of fixed size.
Example:
>>> model = torchvision.models.detection.fasterrcnn_resnet50_fpn(pretrained=True) >>> # For training >>> images, boxes = torch.rand(4, 3, 600, 1200), torch.rand(4, 11, 4) >>> labels = torch.randint(1, 91, (4, 11)) >>> images = list(image for image in images) >>> targets = [] >>> for i in range(len(images)): >>> d = {} >>> d['boxes'] = boxes[i] >>> d['labels'] = labels[i] >>> targets.append(d) >>> output = model(images, targets) >>> # For inference >>> model.eval() >>> x = [torch.rand(3, 300, 400), torch.rand(3, 500, 400)] >>> predictions = model(x) >>> >>> # optionally, if you want to export the model to ONNX: >>> torch.onnx.export(model, x, "faster_rcnn.onnx", opset_version = 11)
Parameters:  pretrained (bool) – If True, returns a model pretrained on COCO train2017
 progress (bool) – If True, displays a progress bar of the download to stderr
 num_classes (int) – number of output classes of the model (including the background)
 pretrained_backbone (bool) – If True, returns a model with backbone pretrained on Imagenet
 trainable_backbone_layers (int) – number of trainable (not frozen) resnet layers starting from final block. Valid values are between 0 and 5, with 5 meaning all backbone layers are trainable.
 boxes (

torchvision.models.detection.
fasterrcnn_mobilenet_v3_large_fpn
(pretrained=False, progress=True, num_classes=91, pretrained_backbone=True, trainable_backbone_layers=None, **kwargs)[source]¶ Constructs a high resolution Faster RCNN model with a MobileNetV3Large FPN backbone. It works similarly to Faster RCNN with ResNet50 FPN backbone. See fasterrcnn_resnet50_fpn for more details.
Example:
>>> model = torchvision.models.detection.fasterrcnn_mobilenet_v3_large_fpn(pretrained=True) >>> model.eval() >>> x = [torch.rand(3, 300, 400), torch.rand(3, 500, 400)] >>> predictions = model(x)
Parameters:  pretrained (bool) – If True, returns a model pretrained on COCO train2017
 progress (bool) – If True, displays a progress bar of the download to stderr
 num_classes (int) – number of output classes of the model (including the background)
 pretrained_backbone (bool) – If True, returns a model with backbone pretrained on Imagenet
 trainable_backbone_layers (int) – number of trainable (not frozen) resnet layers starting from final block. Valid values are between 0 and 6, with 6 meaning all backbone layers are trainable.

torchvision.models.detection.
fasterrcnn_mobilenet_v3_large_320_fpn
(pretrained=False, progress=True, num_classes=91, pretrained_backbone=True, trainable_backbone_layers=None, **kwargs)[source]¶ Constructs a low resolution Faster RCNN model with a MobileNetV3Large FPN backbone tunned for mobile usecases. It works similarly to Faster RCNN with ResNet50 FPN backbone. See fasterrcnn_resnet50_fpn for more details.
Example:
>>> model = torchvision.models.detection.fasterrcnn_mobilenet_v3_large_320_fpn(pretrained=True) >>> model.eval() >>> x = [torch.rand(3, 300, 400), torch.rand(3, 500, 400)] >>> predictions = model(x)
Parameters:  pretrained (bool) – If True, returns a model pretrained on COCO train2017
 progress (bool) – If True, displays a progress bar of the download to stderr
 num_classes (int) – number of output classes of the model (including the background)
 pretrained_backbone (bool) – If True, returns a model with backbone pretrained on Imagenet
 trainable_backbone_layers (int) – number of trainable (not frozen) resnet layers starting from final block. Valid values are between 0 and 6, with 6 meaning all backbone layers are trainable.
RetinaNet¶

torchvision.models.detection.
retinanet_resnet50_fpn
(pretrained=False, progress=True, num_classes=91, pretrained_backbone=True, trainable_backbone_layers=None, **kwargs)[source]¶ Constructs a RetinaNet model with a ResNet50FPN backbone.
The input to the model is expected to be a list of tensors, each of shape
[C, H, W]
, one for each image, and should be in01
range. Different images can have different sizes.The behavior of the model changes depending if it is in training or evaluation mode.
During training, the model expects both the input tensors, as well as a targets (list of dictionary), containing:
 boxes (
FloatTensor[N, 4]
): the groundtruth boxes in[x1, y1, x2, y2]
format, with0 <= x1 < x2 <= W
and0 <= y1 < y2 <= H
.  labels (
Int64Tensor[N]
): the class label for each groundtruth box
The model returns a
Dict[Tensor]
during training, containing the classification and regression losses.During inference, the model requires only the input tensors, and returns the postprocessed predictions as a
List[Dict[Tensor]]
, one for each input image. The fields of theDict
are as follows: boxes (
FloatTensor[N, 4]
): the predicted boxes in[x1, y1, x2, y2]
format, with0 <= x1 < x2 <= W
and0 <= y1 < y2 <= H
.  labels (
Int64Tensor[N]
): the predicted labels for each image  scores (
Tensor[N]
): the scores or each prediction
Example:
>>> model = torchvision.models.detection.retinanet_resnet50_fpn(pretrained=True) >>> model.eval() >>> x = [torch.rand(3, 300, 400), torch.rand(3, 500, 400)] >>> predictions = model(x)
Parameters:  pretrained (bool) – If True, returns a model pretrained on COCO train2017
 progress (bool) – If True, displays a progress bar of the download to stderr
 num_classes (int) – number of output classes of the model (including the background)
 pretrained_backbone (bool) – If True, returns a model with backbone pretrained on Imagenet
 trainable_backbone_layers (int) – number of trainable (not frozen) resnet layers starting from final block. Valid values are between 0 and 5, with 5 meaning all backbone layers are trainable.
 boxes (
Mask RCNN¶

torchvision.models.detection.
maskrcnn_resnet50_fpn
(pretrained=False, progress=True, num_classes=91, pretrained_backbone=True, trainable_backbone_layers=None, **kwargs)[source]¶ Constructs a Mask RCNN model with a ResNet50FPN backbone.
The input to the model is expected to be a list of tensors, each of shape
[C, H, W]
, one for each image, and should be in01
range. Different images can have different sizes.The behavior of the model changes depending if it is in training or evaluation mode.
During training, the model expects both the input tensors, as well as a targets (list of dictionary), containing:
 boxes (
FloatTensor[N, 4]
): the groundtruth boxes in[x1, y1, x2, y2]
format, with0 <= x1 < x2 <= W
and0 <= y1 < y2 <= H
.  labels (
Int64Tensor[N]
): the class label for each groundtruth box  masks (
UInt8Tensor[N, H, W]
): the segmentation binary masks for each instance
The model returns a
Dict[Tensor]
during training, containing the classification and regression losses for both the RPN and the RCNN, and the mask loss.During inference, the model requires only the input tensors, and returns the postprocessed predictions as a
List[Dict[Tensor]]
, one for each input image. The fields of theDict
are as follows: boxes (
FloatTensor[N, 4]
): the predicted boxes in[x1, y1, x2, y2]
format, with0 <= x1 < x2 <= W
and0 <= y1 < y2 <= H
.  labels (
Int64Tensor[N]
): the predicted labels for each image  scores (
Tensor[N]
): the scores or each prediction  masks (
UInt8Tensor[N, 1, H, W]
): the predicted masks for each instance, in01
range. In order to obtain the final segmentation masks, the soft masks can be thresholded, generally with a value of 0.5 (mask >= 0.5
)
Mask RCNN is exportable to ONNX for a fixed batch size with inputs images of fixed size.
Example:
>>> model = torchvision.models.detection.maskrcnn_resnet50_fpn(pretrained=True) >>> model.eval() >>> x = [torch.rand(3, 300, 400), torch.rand(3, 500, 400)] >>> predictions = model(x) >>> >>> # optionally, if you want to export the model to ONNX: >>> torch.onnx.export(model, x, "mask_rcnn.onnx", opset_version = 11)
Parameters:  pretrained (bool) – If True, returns a model pretrained on COCO train2017
 progress (bool) – If True, displays a progress bar of the download to stderr
 num_classes (int) – number of output classes of the model (including the background)
 pretrained_backbone (bool) – If True, returns a model with backbone pretrained on Imagenet
 trainable_backbone_layers (int) – number of trainable (not frozen) resnet layers starting from final block. Valid values are between 0 and 5, with 5 meaning all backbone layers are trainable.
 boxes (
Keypoint RCNN¶

torchvision.models.detection.
keypointrcnn_resnet50_fpn
(pretrained=False, progress=True, num_classes=2, num_keypoints=17, pretrained_backbone=True, trainable_backbone_layers=None, **kwargs)[source]¶ Constructs a Keypoint RCNN model with a ResNet50FPN backbone.
The input to the model is expected to be a list of tensors, each of shape
[C, H, W]
, one for each image, and should be in01
range. Different images can have different sizes.The behavior of the model changes depending if it is in training or evaluation mode.
During training, the model expects both the input tensors, as well as a targets (list of dictionary), containing:
 boxes (
FloatTensor[N, 4]
): the groundtruth boxes in[x1, y1, x2, y2]
format, with0 <= x1 < x2 <= W
and0 <= y1 < y2 <= H
.  labels (
Int64Tensor[N]
): the class label for each groundtruth box  keypoints (
FloatTensor[N, K, 3]
): theK
keypoints location for each of theN
instances, in the format[x, y, visibility]
, wherevisibility=0
means that the keypoint is not visible.
The model returns a
Dict[Tensor]
during training, containing the classification and regression losses for both the RPN and the RCNN, and the keypoint loss.During inference, the model requires only the input tensors, and returns the postprocessed predictions as a
List[Dict[Tensor]]
, one for each input image. The fields of theDict
are as follows: boxes (
FloatTensor[N, 4]
): the predicted boxes in[x1, y1, x2, y2]
format, with0 <= x1 < x2 <= W
and0 <= y1 < y2 <= H
.  labels (
Int64Tensor[N]
): the predicted labels for each image  scores (
Tensor[N]
): the scores or each prediction  keypoints (
FloatTensor[N, K, 3]
): the locations of the predicted keypoints, in[x, y, v]
format.
Keypoint RCNN is exportable to ONNX for a fixed batch size with inputs images of fixed size.
Example:
>>> model = torchvision.models.detection.keypointrcnn_resnet50_fpn(pretrained=True) >>> model.eval() >>> x = [torch.rand(3, 300, 400), torch.rand(3, 500, 400)] >>> predictions = model(x) >>> >>> # optionally, if you want to export the model to ONNX: >>> torch.onnx.export(model, x, "keypoint_rcnn.onnx", opset_version = 11)
Parameters:  pretrained (bool) – If True, returns a model pretrained on COCO train2017
 progress (bool) – If True, displays a progress bar of the download to stderr
 num_classes (int) – number of output classes of the model (including the background)
 num_keypoints (int) – number of keypoints, default 17
 pretrained_backbone (bool) – If True, returns a model with backbone pretrained on Imagenet
 trainable_backbone_layers (int) – number of trainable (not frozen) resnet layers starting from final block. Valid values are between 0 and 5, with 5 meaning all backbone layers are trainable.
 boxes (
Video classification¶
We provide models for action recognition pretrained on Kinetics400.
They have all been trained with the scripts provided in references/video_classification
.
All pretrained models expect input images normalized in the same way,
i.e. minibatches of 3channel RGB videos of shape (3 x T x H x W),
where H and W are expected to be 112, and T is a number of video frames in a clip.
The images have to be loaded in to a range of [0, 1] and then normalized
using mean = [0.43216, 0.394666, 0.37645]
and std = [0.22803, 0.22145, 0.216989]
.
Note
The normalization parameters are different from the image classification ones, and correspond to the mean and std from Kinetics400.
Note
For now, normalization code can be found in references/video_classification/transforms.py
,
see the Normalize
function there. Note that it differs from standard normalization for
images because it assumes the video is 4d.
Kinetics 1crop accuracies for clip length 16 (16x112x112)
Network  Clip acc@1  Clip acc@5 

ResNet 3D 18  52.75  75.45 
ResNet MC 18  53.90  76.29 
ResNet (2+1)D  57.50  78.81 
ResNet 3D¶

torchvision.models.video.
r3d_18
(pretrained=False, progress=True, **kwargs)[source]¶ Construct 18 layer Resnet3D model as in https://arxiv.org/abs/1711.11248
Parameters: Returns: R3D18 network
Return type: nn.Module
ResNet Mixed Convolution¶

torchvision.models.video.
mc3_18
(pretrained=False, progress=True, **kwargs)[source]¶ Constructor for 18 layer Mixed Convolution network as in https://arxiv.org/abs/1711.11248
Parameters: Returns: MC3 Network definition
Return type: nn.Module
ResNet (2+1)D¶

torchvision.models.video.
r2plus1d_18
(pretrained=False, progress=True, **kwargs)[source]¶ Constructor for the 18 layer deep R(2+1)D network as in https://arxiv.org/abs/1711.11248
Parameters: Returns: R(2+1)D18 network
Return type: nn.Module