TilePositionalEmbedding¶
- class torchtune.models.clip.TilePositionalEmbedding(max_num_tiles: int, embed_dim: int)[source]¶
Positional embedding for tiles, different for every tile, same for every token within a tile.
Notice that tile is different from patch (token). For details, please check the documentation of
torchtune.modules.vision_transformer.VisionTransformer
.- Parameters:
- forward(x: Tensor, aspect_ratio: Tensor) Tensor [source]¶
- Parameters:
x (torch.Tensor) – torch.Tensor with shape (bsz * n_imgs, n_tiles, n_tokens_per_tile, embed_dim).
aspect_ratio (torch.Tensor) – torch.Tensor with shape (bsz * n_imgs, 2), representing the aspect ratio of the image before tile-cropping, e.g. (2,1).
- Returns:
The input tensor with added positional embeddings.
- Return type: