Shortcuts

TokenPositionalEmbedding

class torchtune.models.clip.TokenPositionalEmbedding(embed_dim: int, tile_size: int, patch_size: int)[source]

Token positional embedding for images, different for every token in an image.

Notice that tile is different from patch (token). For details, please check the documentation of torchtune.modules.vision_transformer.VisionTransformer.

Parameters:
  • embed_dim (int) – The dimensionality of each token embedding.

  • tile_size (int) – The size of your image tiles, if the image was tile-cropped in advance. Otherwise, the size of the input image. In this case, the function will consider your image as a single tile.

  • patch_size (int) – The size of each patch. Used to divide the tiles into patches. E.g. for patch_size=40, a tile of shape (400, 400) will have 10x10 grid of patches with shape (40, 40) each.

forward(x: Tensor, *args: Tuple[Any]) Tensor[source]
Parameters:
  • x (torch.Tensor) – torch.Tensor with shape (…, n_tokens_per_tile, embed_dim)

  • *args (Tuple[Any]) – Optional args.

Returns:

The input tensor with added positional embeddings.

Return type:

torch.Tensor

Docs

Access comprehensive developer documentation for PyTorch

View Docs

Tutorials

Get in-depth tutorials for beginners and advanced developers

View Tutorials

Resources

Find development resources and get your questions answered

View Resources