clip_vision_encoder

torchtune.models.clip.clip_vision_encoder(tile_size: int, patch_size: int, embed_dim: int, num_layers: int, num_heads: int, activation: ~typing.Callable = <class 'torch.nn.modules.activation.SiLU'>, cls_output_dim: int = 512, attn_bias: bool = True, use_rope: bool = False, out_indices: ~typing.Optional[~typing.List[int]] = None, output_cls_projection: bool = False, max_num_tiles: int = 4, in_channels: int = 3, append_cls_token: bool = False, use_tile_pos_embed: bool = True) → VisionTransformer[source]

Builds the vision encoder associated with the clip model. This includes:

TransformerEncoderLayer
positional embeddings
CLS projection (optional)

For details, please check the documentation of torchtune.modules.vision_transformer.VisionTransformer.

Parameters:

tile_size (int) – The size of your image tiles, if the image was tile-cropped in advance. Otherwise, the size of the input image. In this case, the function will consider your image as a single tile.
patch_size (int) – The size of each patch. Used to divide the tiles into patches. E.g. for patch_size=40, a tile of shape (400, 400) will have 10x10 grid of patches with shape (40, 40) each.
embed_dim (int) – The dimensionality of each patch embedding (token).
num_layers (int) – The number of transformer layers.
num_heads (int) – The number of attention heads in each transformer layer.
activation (Callable) – The activation function to use in the MLP layer.
cls_output_dim (int) – The dimensionality of the output tensor from the CLS projection module.
attn_bias (bool) – Boolean for if to use bias in the attention module. Default True.
use_rope (bool) – If True, include 2D rope in attention in each transformer layer. Default: False
out_indices (Optional[List[int]]) – The indices of hidden layers to return. If provided, it will return the intermediate results of the transformer layers before they go through a next layer. For example, out_indices=[0,3] will return the tokens before they go through the first and fourth layers.
output_cls_projection (bool) – If True, only the CLS token projection will be outputted, instead of all tokens. Defaults to False.
max_num_tiles (int) – The maximum number of tiles that can be processed. This is used to determine the size of the positional embeddings.
in_channels (int) – The number of image input channels.
append_cls_token (bool) – If True, adds CLS token embedding to the end of the sequence in the vision transformer. Default is False, which adds CLS token to the beginning of the sequence.
use_tile_pos_embed (bool) – If True, use pre-tile, post-tile, and tiled token positional embeddings, if max_num_tiles > 1. If False, only use standard token positional embeddings.

Returns:

A VisionTransformer object.

Raises:

AssertionError – If embed_dim is not divisible by num_heads.

clip_vision_encoder

Docs

Tutorials

Resources