TiledTokenPositionalEmbedding¶
- class torchtune.models.clip.TiledTokenPositionalEmbedding(max_num_tiles: int, embed_dim: int, tile_size: int, patch_size: int)[source]¶
Token positional embedding for tiled images, different for every tile, different for every token.
There are two positional embeddings in this module:
local_token_positional_embedding: same for every tile, different for every token. Equivalent to
torchtune.models.clip._position_embeddings.TokenPositionalEmbedding
, but gated.global_token_positional_embedding: different for every tile, different for every token.
Notice that tile is different from patch (token). For details, please check the documentation of
torchtune.modules.vision_transformer.VisionTransformer
.- Parameters:
max_num_tiles (int) – The maximum number of tiles an image can be divided into.
embed_dim (int) – The dimensionality of each token embedding.
tile_size (int) – The size of your image tiles, if the image was tile-cropped in advance. Otherwise, the size of the input image. In this case, the function will consider your image as a single tile.
patch_size (int) – The size of each patch. Used to divide the tiles into patches. E.g. for
patch_size=40
, a tile of shape (400, 400) will have 10x10 grid of patches with shape (40, 40) each.
- forward(x: Tensor, aspect_ratio: Tensor) Tensor [source]¶
- Parameters:
x (torch.Tensor) – torch.Tensor with shape (bsz * n_imgs, n_tiles, n_tokens_per_tile, embed_dim).
aspect_ratio (torch.Tensor) – torch.Tensor with shape (bsz * n_imgs, 2), where aspect_ratio[k] represents the aspect ratio of the k^th image of the batch before tile-cropping, e.g. aspect_ratio[k] = (2,1).
- Returns:
The input tensor with added positional embeddings.
- Return type: