torchvision.ops¶
torchvision.ops
implements operators that are specific for Computer Vision.
Note
All operators have native support for TorchScript.

torchvision.ops.
nms
(boxes: torch.Tensor, scores: torch.Tensor, iou_threshold: float) → torch.Tensor[source]¶ Performs nonmaximum suppression (NMS) on the boxes according to their intersectionoverunion (IoU).
NMS iteratively removes lower scoring boxes which have an IoU greater than iou_threshold with another (higher scoring) box.
If multiple boxes have the exact same score and satisfy the IoU criterion with respect to a reference box, the selected box is not guaranteed to be the same between CPU and GPU. This is similar to the behavior of argsort in PyTorch when repeated values are present.
Parameters:  boxes (Tensor[N, 4])) – boxes to perform NMS on. They are expected to be in (x1, y1, x2, y2) format
 scores (Tensor[N]) – scores for each one of the boxes
 iou_threshold (float) – discards all overlapping boxes with IoU > iou_threshold
Returns: keep – int64 tensor with the indices of the elements that have been kept by NMS, sorted in decreasing order of scores
Return type: Tensor

torchvision.ops.
batched_nms
(boxes: torch.Tensor, scores: torch.Tensor, idxs: torch.Tensor, iou_threshold: float) → torch.Tensor[source]¶ Performs nonmaximum suppression in a batched fashion.
Each index value correspond to a category, and NMS will not be applied between elements of different categories.
Parameters:  boxes (Tensor[N, 4]) – boxes where NMS will be performed. They are expected to be in (x1, y1, x2, y2) format
 scores (Tensor[N]) – scores for each one of the boxes
 idxs (Tensor[N]) – indices of the categories for each one of the boxes.
 iou_threshold (float) – discards all overlapping boxes with IoU > iou_threshold
Returns: keep – int64 tensor with the indices of the elements that have been kept by NMS, sorted in decreasing order of scores
Return type: Tensor

torchvision.ops.
remove_small_boxes
(boxes: torch.Tensor, min_size: float) → torch.Tensor[source]¶ Remove boxes which contains at least one side smaller than min_size.
Parameters:  boxes (Tensor[N, 4]) – boxes in (x1, y1, x2, y2) format
 min_size (float) – minimum size
Returns:  indices of the boxes that have both sides
larger than min_size
Return type: keep (Tensor[K])

torchvision.ops.
clip_boxes_to_image
(boxes: torch.Tensor, size: Tuple[int, int]) → torch.Tensor[source]¶ Clip boxes so that they lie inside an image of size size.
Parameters:  boxes (Tensor[N, 4]) – boxes in (x1, y1, x2, y2) format
 size (Tuple[height, width]) – size of the image
Returns: clipped_boxes (Tensor[N, 4])

torchvision.ops.
box_convert
(boxes: torch.Tensor, in_fmt: str, out_fmt: str) → torch.Tensor[source]¶ Converts boxes from given in_fmt to out_fmt. Supported in_fmt and out_fmt are:
‘xyxy’: boxes are represented via corners, x1, y1 being top left and x2, y2 being bottom right.
‘xywh’ : boxes are represented via corner, width and height, x1, y2 being top left, w, h being width and height.
‘cxcywh’ : boxes are represented via centre, width and height, cx, cy being center of box, w, h being width and height.
Parameters: Returns: Boxes into converted format.
Return type: boxes (Tensor[N, 4])

torchvision.ops.
box_area
(boxes: torch.Tensor) → torch.Tensor[source]¶ Computes the area of a set of bounding boxes, which are specified by its (x1, y1, x2, y2) coordinates.
Parameters: boxes (Tensor[N, 4]) – boxes for which the area will be computed. They are expected to be in (x1, y1, x2, y2) format Returns: area for each box Return type: area (Tensor[N])

torchvision.ops.
box_iou
(boxes1: torch.Tensor, boxes2: torch.Tensor) → torch.Tensor[source]¶ Return intersectionoverunion (Jaccard index) of boxes.
Both sets of boxes are expected to be in (x1, y1, x2, y2) format.
Parameters:  boxes1 (Tensor[N, 4]) –
 boxes2 (Tensor[M, 4]) –
Returns: the NxM matrix containing the pairwise IoU values for every element in boxes1 and boxes2
Return type: iou (Tensor[N, M])

torchvision.ops.
generalized_box_iou
(boxes1: torch.Tensor, boxes2: torch.Tensor) → torch.Tensor[source]¶ Return generalized intersectionoverunion (Jaccard index) of boxes.
Both sets of boxes are expected to be in (x1, y1, x2, y2) format.
Parameters:  boxes1 (Tensor[N, 4]) –
 boxes2 (Tensor[M, 4]) –
Returns: the NxM matrix containing the pairwise generalized_IoU values for every element in boxes1 and boxes2
Return type: generalized_iou (Tensor[N, M])

torchvision.ops.
roi_align
(input: torch.Tensor, boxes: torch.Tensor, output_size: None, spatial_scale: float = 1.0, sampling_ratio: int = 1, aligned: bool = False) → torch.Tensor[source]¶ Performs Region of Interest (RoI) Align operator described in Mask RCNN
Parameters:  input (Tensor[N, C, H, W]) – input tensor
 boxes (Tensor[K, 5] or List[Tensor[L, 4]]) – the box coordinates in (x1, y1, x2, y2) format where the regions will be taken from. If a single Tensor is passed, then the first column should contain the batch index. If a list of Tensors is passed, then each Tensor will correspond to the boxes for an element i in a batch
 output_size (int or Tuple[int, int]) – the size of the output after the cropping is performed, as (height, width)
 spatial_scale (float) – a scaling factor that maps the input coordinates to the box coordinates. Default: 1.0
 sampling_ratio (int) – number of sampling points in the interpolation grid used to compute the output value of each pooled output bin. If > 0, then exactly sampling_ratio x sampling_ratio grid points are used. If <= 0, then an adaptive number of grid points are used (computed as ceil(roi_width / pooled_w), and likewise for height). Default: 1
 aligned (bool) – If False, use the legacy implementation. If True, pixel shift it by 0.5 for align more perfectly about two neighboring pixel indices. This version in Detectron2
Returns: output (Tensor[K, C, output_size[0], output_size[1]])

torchvision.ops.
ps_roi_align
(input: torch.Tensor, boxes: torch.Tensor, output_size: int, spatial_scale: float = 1.0, sampling_ratio: int = 1) → torch.Tensor[source]¶ Performs PositionSensitive Region of Interest (RoI) Align operator mentioned in LightHead RCNN.
Parameters:  input (Tensor[N, C, H, W]) – input tensor
 boxes (Tensor[K, 5] or List[Tensor[L, 4]]) – the box coordinates in (x1, y1, x2, y2) format where the regions will be taken from. If a single Tensor is passed, then the first column should contain the batch index. If a list of Tensors is passed, then each Tensor will correspond to the boxes for an element i in a batch
 output_size (int or Tuple[int, int]) – the size of the output after the cropping is performed, as (height, width)
 spatial_scale (float) – a scaling factor that maps the input coordinates to the box coordinates. Default: 1.0
 sampling_ratio (int) – number of sampling points in the interpolation grid used to compute the output value of each pooled output bin. If > 0 then exactly sampling_ratio x sampling_ratio grid points are used. If <= 0, then an adaptive number of grid points are used (computed as ceil(roi_width / pooled_w), and likewise for height). Default: 1
Returns: output (Tensor[K, C, output_size[0], output_size[1]])

torchvision.ops.
roi_pool
(input: torch.Tensor, boxes: torch.Tensor, output_size: None, spatial_scale: float = 1.0) → torch.Tensor[source]¶ Performs Region of Interest (RoI) Pool operator described in Fast RCNN
Parameters:  input (Tensor[N, C, H, W]) – input tensor
 boxes (Tensor[K, 5] or List[Tensor[L, 4]]) – the box coordinates in (x1, y1, x2, y2) format where the regions will be taken from. If a single Tensor is passed, then the first column should contain the batch index. If a list of Tensors is passed, then each Tensor will correspond to the boxes for an element i in a batch
 output_size (int or Tuple[int, int]) – the size of the output after the cropping is performed, as (height, width)
 spatial_scale (float) – a scaling factor that maps the input coordinates to the box coordinates. Default: 1.0
Returns: output (Tensor[K, C, output_size[0], output_size[1]])

torchvision.ops.
ps_roi_pool
(input: torch.Tensor, boxes: torch.Tensor, output_size: int, spatial_scale: float = 1.0) → torch.Tensor[source]¶ Performs PositionSensitive Region of Interest (RoI) Pool operator described in RFCN
Parameters:  input (Tensor[N, C, H, W]) – input tensor
 boxes (Tensor[K, 5] or List[Tensor[L, 4]]) – the box coordinates in (x1, y1, x2, y2) format where the regions will be taken from. If a single Tensor is passed, then the first column should contain the batch index. If a list of Tensors is passed, then each Tensor will correspond to the boxes for an element i in a batch
 output_size (int or Tuple[int, int]) – the size of the output after the cropping is performed, as (height, width)
 spatial_scale (float) – a scaling factor that maps the input coordinates to the box coordinates. Default: 1.0
Returns: output (Tensor[K, C, output_size[0], output_size[1]])

torchvision.ops.
deform_conv2d
(input: torch.Tensor, offset: torch.Tensor, weight: torch.Tensor, bias: Union[torch.Tensor, NoneType] = None, stride: Tuple[int, int] = (1, 1), padding: Tuple[int, int] = (0, 0), dilation: Tuple[int, int] = (1, 1)) → torch.Tensor[source]¶ Performs Deformable Convolution, described in Deformable Convolutional Networks
Parameters:  input (Tensor[batch_size, in_channels, in_height, in_width]) – input tensor
 (Tensor[batch_size, 2 * offset_groups * kernel_height * kernel_width, (offset) – out_height, out_width]): offsets to be applied for each position in the convolution kernel.
 weight (Tensor[out_channels, in_channels // groups, kernel_height, kernel_width]) – convolution weights, split into groups of size (in_channels // groups)
 bias (Tensor[out_channels]) – optional bias of shape (out_channels,). Default: None
 stride (int or Tuple[int, int]) – distance between convolution centers. Default: 1
 padding (int or Tuple[int, int]) – height/width of padding of zeroes around each image. Default: 0
 dilation (int or Tuple[int, int]) – the spacing between kernel elements. Default: 1
Returns: result of convolution
Return type: output (Tensor[batch_sz, out_channels, out_h, out_w])
 Examples::
>>> input = torch.rand(4, 3, 10, 10) >>> kh, kw = 3, 3 >>> weight = torch.rand(5, 3, kh, kw) >>> # offset should have the same spatial size as the output >>> # of the convolution. In this case, for an input of 10, stride of 1 >>> # and kernel size of 3, without padding, the output size is 8 >>> offset = torch.rand(4, 2 * kh * kw, 8, 8) >>> out = deform_conv2d(input, offset, weight) >>> print(out.shape) >>> # returns >>> torch.Size([4, 5, 8, 8])

class
torchvision.ops.
RoIAlign
(output_size: None, spatial_scale: float, sampling_ratio: int, aligned: bool = False)[source]¶ See roi_align

class
torchvision.ops.
PSRoIAlign
(output_size: int, spatial_scale: float, sampling_ratio: int)[source]¶ See ps_roi_align

class
torchvision.ops.
DeformConv2d
(in_channels: int, out_channels: int, kernel_size: int, stride: int = 1, padding: int = 0, dilation: int = 1, groups: int = 1, bias: bool = True)[source]¶ See deform_conv2d

class
torchvision.ops.
MultiScaleRoIAlign
(featmap_names: List[str], output_size: Union[int, Tuple[int], List[int]], sampling_ratio: int)[source]¶ Multiscale RoIAlign pooling, which is useful for detection with or without FPN.
It infers the scale of the pooling via the heuristics present in the FPN paper.
Parameters: Examples:
>>> m = torchvision.ops.MultiScaleRoIAlign(['feat1', 'feat3'], 3, 2) >>> i = OrderedDict() >>> i['feat1'] = torch.rand(1, 5, 64, 64) >>> i['feat2'] = torch.rand(1, 5, 32, 32) # this feature won't be used in the pooling >>> i['feat3'] = torch.rand(1, 5, 16, 16) >>> # create some random bounding boxes >>> boxes = torch.rand(6, 4) * 256; boxes[:, 2:] += boxes[:, :2] >>> # original image size, before computing the feature maps >>> image_sizes = [(512, 512)] >>> output = m(i, [boxes], image_sizes) >>> print(output.shape) >>> torch.Size([6, 5, 3, 3])

class
torchvision.ops.
FeaturePyramidNetwork
(in_channels_list: List[int], out_channels: int, extra_blocks: Union[torchvision.ops.feature_pyramid_network.ExtraFPNBlock, NoneType] = None)[source]¶ Module that adds a FPN from on top of a set of feature maps. This is based on “Feature Pyramid Network for Object Detection”.
The feature maps are currently supposed to be in increasing depth order.
The input to the model is expected to be an OrderedDict[Tensor], containing the feature maps on top of which the FPN will be added.
Parameters:  in_channels_list (list[int]) – number of channels for each feature map that is passed to the module
 out_channels (int) – number of channels of the FPN representation
 extra_blocks (ExtraFPNBlock or None) – if provided, extra operations will be performed. It is expected to take the fpn features, the original features and the names of the original features as input, and returns a new list of feature maps and their corresponding names
Examples:
>>> m = torchvision.ops.FeaturePyramidNetwork([10, 20, 30], 5) >>> # get some dummy data >>> x = OrderedDict() >>> x['feat0'] = torch.rand(1, 10, 64, 64) >>> x['feat2'] = torch.rand(1, 20, 16, 16) >>> x['feat3'] = torch.rand(1, 30, 8, 8) >>> # compute the FPN on top of x >>> output = m(x) >>> print([(k, v.shape) for k, v in output.items()]) >>> # returns >>> [('feat0', torch.Size([1, 5, 64, 64])), >>> ('feat2', torch.Size([1, 5, 16, 16])), >>> ('feat3', torch.Size([1, 5, 8, 8]))]