• Docs >
  • Transforming and augmenting images
Shortcuts

Transforming and augmenting images

Torchvision supports common computer vision transformations in the torchvision.transforms and torchvision.transforms.v2 modules. Transforms can be used to transform or augment data for training or inference of different tasks (image classification, detection, segmentation, video classification).

# Image Classification
import torch
from torchvision.transforms import v2

H, W = 32, 32
img = torch.randint(0, 256, size=(3, H, W), dtype=torch.uint8)

transforms = v2.Compose([
    v2.RandomResizedCrop(size=(224, 224), antialias=True),
    v2.RandomHorizontalFlip(p=0.5),
    v2.ToDtype(torch.float32, scale=True),
    v2.Normalize(mean=[0.485, 0.456, 0.406], std=[0.229, 0.224, 0.225]),
])
img = transforms(img)
# Detection (re-using imports and transforms from above)
from torchvision import tv_tensors

img = torch.randint(0, 256, size=(3, H, W), dtype=torch.uint8)
boxes = torch.randint(0, H // 2, size=(3, 4))
boxes[:, 2:] += boxes[:, :2]
boxes = tv_tensors.BoundingBoxes(boxes, format="XYXY", canvas_size=(H, W))

# The same transforms can be used!
img, boxes = transforms(img, boxes)
# And you can pass arbitrary input structures
output_dict = transforms({"image": img, "boxes": boxes})

Transforms are typically passed as the transform or transforms argument to the Datasets.

Start here

Whether you’re new to Torchvision transforms, or you’re already experienced with them, we encourage you to start with Getting started with transforms v2 in order to learn more about what can be done with the new v2 transforms.

Then, browse the sections in below this page for general information and performance tips. The available transforms and functionals are listed in the API reference.

More information and tutorials can also be found in our example gallery, e.g. Transforms v2: End-to-end object detection/segmentation example or How to write your own v2 transforms.

Supported input types and conventions

Most transformations accept both PIL images and tensor inputs. Both CPU and CUDA tensors are supported. The result of both backends (PIL or Tensors) should be very close. In general, we recommend relying on the tensor backend for performance. The conversion transforms may be used to convert to and from PIL images, or for converting dtypes and ranges.

Tensor image are expected to be of shape (C, H, W), where C is the number of channels, and H and W refer to height and width. Most transforms support batched tensor input. A batch of Tensor images is a tensor of shape (N, C, H, W), where N is a number of images in the batch. The v2 transforms generally accept an arbitrary number of leading dimensions (..., C, H, W) and can handle batched images or batched videos.

Dtype and expected value range

The expected range of the values of a tensor image is implicitly defined by the tensor dtype. Tensor images with a float dtype are expected to have values in [0, 1]. Tensor images with an integer dtype are expected to have values in [0, MAX_DTYPE] where MAX_DTYPE is the largest value that can be represented in that dtype. Typically, images of dtype torch.uint8 are expected to have values in [0, 255].

Use ToDtype to convert both the dtype and range of the inputs.

V1 or V2? Which one should I use?

TL;DR We recommending using the torchvision.transforms.v2 transforms instead of those in torchvision.transforms. They’re faster and they can do more things. Just change the import and you should be good to go.

In Torchvision 0.15 (March 2023), we released a new set of transforms available in the torchvision.transforms.v2 namespace. These transforms have a lot of advantages compared to the v1 ones (in torchvision.transforms):

These transforms are fully backward compatible with the v1 ones, so if you’re already using tranforms from torchvision.transforms, all you need to do to is to update the import to torchvision.transforms.v2. In terms of output, there might be negligible differences due to implementation differences.

Note

The v2 transforms are still BETA, but at this point we do not expect disruptive changes to be made to their public APIs. We’re planning to make them fully stable in version 0.17. Please submit any feedback you may have here.

Performance considerations

We recommend the following guidelines to get the best performance out of the transforms:

  • Rely on the v2 transforms from torchvision.transforms.v2

  • Use tensors instead of PIL images

  • Use torch.uint8 dtype, especially for resizing

  • Resize with bilinear or bicubic mode

This is what a typical transform pipeline could look like:

from torchvision.transforms import v2
transforms = v2.Compose([
    v2.ToImage(),  # Convert to tensor, only needed if you had a PIL image
    v2.ToDtype(torch.uint8, scale=True),  # optional, most input are already uint8 at this point
    # ...
    v2.RandomResizedCrop(size=(224, 224), antialias=True),  # Or Resize(antialias=True)
    # ...
    v2.ToDtype(torch.float32, scale=True),  # Normalize expects float input
    v2.Normalize(mean=[0.485, 0.456, 0.406], std=[0.229, 0.224, 0.225]),
])

The above should give you the best performance in a typical training environment that relies on the torch.utils.data.DataLoader with num_workers > 0.

Transforms tend to be sensitive to the input strides / memory format. Some transforms will be faster with channels-first images while others prefer channels-last. Like torch operators, most transforms will preserve the memory format of the input, but this may not always be respected due to implementation details. You may want to experiment a bit if you’re chasing the very best performance. Using torch.compile() on individual transforms may also help factoring out the memory format variable (e.g. on Normalize). Note that we’re talking about memory format, not tensor shape.

Note that resize transforms like Resize and RandomResizedCrop typically prefer channels-last input and tend not to benefit from torch.compile() at this time.

Transform classes, functionals, and kernels

Transforms are available as classes like Resize, but also as functionals like resize() in the torchvision.transforms.v2.functional namespace. This is very much like the torch.nn package which defines both classes and functional equivalents in torch.nn.functional.

The functionals support PIL images, pure tensors, or TVTensors, e.g. both resize(image_tensor) and resize(boxes) are valid.

Note

Random transforms like RandomCrop will randomly sample some parameter each time they’re called. Their functional counterpart (crop()) does not do any kind of random sampling and thus have a slighlty different parametrization. The get_params() class method of the transforms class can be used to perform parameter sampling when using the functional APIs.

The torchvision.transforms.v2.functional namespace also contains what we call the “kernels”. These are the low-level functions that implement the core functionalities for specific types, e.g. resize_bounding_boxes or `resized_crop_mask. They are public, although not documented. Check the code to see which ones are available (note that those starting with a leading underscore are not public!). Kernels are only really useful if you want torchscript support for types like bounding boxes or masks.

Torchscript support

Most transform classes and functionals support torchscript. For composing transforms, use torch.nn.Sequential instead of Compose:

transforms = torch.nn.Sequential(
    CenterCrop(10),
    Normalize((0.485, 0.456, 0.406), (0.229, 0.224, 0.225)),
)
scripted_transforms = torch.jit.script(transforms)

Warning

v2 transforms support torchscript, but if you call torch.jit.script() on a v2 class transform, you’ll actually end up with its (scripted) v1 equivalent. This may lead to slightly different results between the scripted and eager executions due to implementation differences between v1 and v2.

If you really need torchscript support for the v2 transforms, we recommend scripting the functionals from the torchvision.transforms.v2.functional namespace to avoid surprises.

Also note that the functionals only support torchscript for pure tensors, which are always treated as images. If you need torchscript support for other types like bounding boxes or masks, you can rely on the low-level kernels.

For any custom transformations to be used with torch.jit.script, they should be derived from torch.nn.Module.

See also: Torchscript support.

V1 API Reference

Geometry

Resize(size[, interpolation, max_size, ...])

Resize the input image to the given size.

RandomCrop(size[, padding, pad_if_needed, ...])

Crop the given image at a random location.

RandomResizedCrop(size[, scale, ratio, ...])

Crop a random portion of image and resize it to a given size.

CenterCrop(size)

Crops the given image at the center.

FiveCrop(size)

Crop the given image into four corners and the central crop.

TenCrop(size[, vertical_flip])

Crop the given image into four corners and the central crop plus the flipped version of these (horizontal flipping is used by default).

Pad(padding[, fill, padding_mode])

Pad the given image on all sides with the given "pad" value.

RandomRotation(degrees[, interpolation, ...])

Rotate the image by angle.

RandomAffine(degrees[, translate, scale, ...])

Random affine transformation of the image keeping center invariant.

RandomPerspective([distortion_scale, p, ...])

Performs a random perspective transformation of the given image with a given probability.

ElasticTransform([alpha, sigma, ...])

Transform a tensor image with elastic transformations.

RandomHorizontalFlip([p])

Horizontally flip the given image randomly with a given probability.

RandomVerticalFlip([p])

Vertically flip the given image randomly with a given probability.

Color

ColorJitter([brightness, contrast, ...])

Randomly change the brightness, contrast, saturation and hue of an image.

Grayscale([num_output_channels])

Convert image to grayscale.

RandomGrayscale([p])

Randomly convert image to grayscale with a probability of p (default 0.1).

GaussianBlur(kernel_size[, sigma])

Blurs image with randomly chosen Gaussian blur.

RandomInvert([p])

Inverts the colors of the given image randomly with a given probability.

RandomPosterize(bits[, p])

Posterize the image randomly with a given probability by reducing the number of bits for each color channel.

RandomSolarize(threshold[, p])

Solarize the image randomly with a given probability by inverting all pixel values above a threshold.

RandomAdjustSharpness(sharpness_factor[, p])

Adjust the sharpness of the image randomly with a given probability.

RandomAutocontrast([p])

Autocontrast the pixels of the given image randomly with a given probability.

RandomEqualize([p])

Equalize the histogram of the given image randomly with a given probability.

Composition

Compose(transforms)

Composes several transforms together.

RandomApply(transforms[, p])

Apply randomly a list of transformations with a given probability.

RandomChoice(transforms[, p])

Apply single transformation randomly picked from a list.

RandomOrder(transforms)

Apply a list of transformations in a random order.

Miscellaneous

LinearTransformation(transformation_matrix, ...)

Transform a tensor image with a square transformation matrix and a mean_vector computed offline.

Normalize(mean, std[, inplace])

Normalize a tensor image with mean and standard deviation.

RandomErasing([p, scale, ratio, value, inplace])

Randomly selects a rectangle region in a torch.Tensor image and erases its pixels.

Lambda(lambd)

Apply a user-defined lambda as a transform.

Conversion

Note

Beware, some of these conversion transforms below will scale the values while performing the conversion, while some may not do any scaling. By scaling, we mean e.g. that a uint8 -> float32 would map the [0, 255] range into [0, 1] (and vice-versa). See Dtype and expected value range.

ToPILImage([mode])

Convert a tensor or an ndarray to PIL Image

ToTensor()

Convert a PIL Image or ndarray to tensor and scale the values accordingly.

PILToTensor()

Convert a PIL Image to a tensor of the same type - this does not scale values.

ConvertImageDtype(dtype)

Convert a tensor image to the given dtype and scale the values accordingly.

Auto-Augmentation

AutoAugment is a common Data Augmentation technique that can improve the accuracy of Image Classification models. Though the data augmentation policies are directly linked to their trained dataset, empirical studies show that ImageNet policies provide significant improvements when applied to other datasets. In TorchVision we implemented 3 policies learned on the following datasets: ImageNet, CIFAR10 and SVHN. The new transform can be used standalone or mixed-and-matched with existing transforms:

AutoAugmentPolicy(value)

AutoAugment policies learned on different datasets.

AutoAugment([policy, interpolation, fill])

AutoAugment data augmentation method based on "AutoAugment: Learning Augmentation Strategies from Data".

RandAugment([num_ops, magnitude, ...])

RandAugment data augmentation method based on "RandAugment: Practical automated data augmentation with a reduced search space".

TrivialAugmentWide([num_magnitude_bins, ...])

Dataset-independent data-augmentation with TrivialAugment Wide, as described in "TrivialAugment: Tuning-free Yet State-of-the-Art Data Augmentation".

AugMix([severity, mixture_width, ...])

AugMix data augmentation method based on "AugMix: A Simple Data Processing Method to Improve Robustness and Uncertainty".

Functional Transforms

adjust_brightness(img, brightness_factor)

Adjust brightness of an image.

adjust_contrast(img, contrast_factor)

Adjust contrast of an image.

adjust_gamma(img, gamma[, gain])

Perform gamma correction on an image.

adjust_hue(img, hue_factor)

Adjust hue of an image.

adjust_saturation(img, saturation_factor)

Adjust color saturation of an image.

adjust_sharpness(img, sharpness_factor)

Adjust the sharpness of an image.

affine(img, angle, translate, scale, shear)

Apply affine transformation on the image keeping image center invariant.

autocontrast(img)

Maximize contrast of an image by remapping its pixels per channel so that the lowest becomes black and the lightest becomes white.

center_crop(img, output_size)

Crops the given image at the center.

convert_image_dtype(image[, dtype])

Convert a tensor image to the given dtype and scale the values accordingly This function does not support PIL Image.

crop(img, top, left, height, width)

Crop the given image at specified location and output size.

equalize(img)

Equalize the histogram of an image by applying a non-linear mapping to the input in order to create a uniform distribution of grayscale values in the output.

erase(img, i, j, h, w, v[, inplace])

Erase the input Tensor Image with given value.

five_crop(img, size)

Crop the given image into four corners and the central crop.

gaussian_blur(img, kernel_size[, sigma])

Performs Gaussian blurring on the image by given kernel.

get_dimensions(img)

Returns the dimensions of an image as [channels, height, width].

get_image_num_channels(img)

Returns the number of channels of an image.

get_image_size(img)

Returns the size of an image as [width, height].

hflip(img)

Horizontally flip the given image.

invert(img)

Invert the colors of an RGB/grayscale image.

normalize(tensor, mean, std[, inplace])

Normalize a float tensor image with mean and standard deviation.

pad(img, padding[, fill, padding_mode])

Pad the given image on all sides with the given "pad" value.

perspective(img, startpoints, endpoints[, ...])

Perform perspective transform of the given image.

pil_to_tensor(pic)

Convert a PIL Image to a tensor of the same type.

posterize(img, bits)

Posterize an image by reducing the number of bits for each color channel.

resize(img, size[, interpolation, max_size, ...])

Resize the input image to the given size.

resized_crop(img, top, left, height, width, size)

Crop the given image and resize it to desired size.

rgb_to_grayscale(img[, num_output_channels])

Convert RGB image to grayscale version of image.

rotate(img, angle[, interpolation, expand, ...])

Rotate the image by angle.

solarize(img, threshold)

Solarize an RGB/grayscale image by inverting all pixel values above a threshold.

ten_crop(img, size[, vertical_flip])

Generate ten cropped images from the given image.

to_grayscale(img[, num_output_channels])

Convert PIL image of any mode (RGB, HSV, LAB, etc) to grayscale version of image.

to_pil_image(pic[, mode])

Convert a tensor or an ndarray to PIL Image.

to_tensor(pic)

Convert a PIL Image or numpy.ndarray to tensor.

vflip(img)

Vertically flip the given image.

Docs

Access comprehensive developer documentation for PyTorch

View Docs

Tutorials

Get in-depth tutorials for beginners and advanced developers

View Tutorials

Resources

Find development resources and get your questions answered

View Resources