clip_vision_encoder¶
- torchtune.models.clip.clip_vision_encoder(tile_size: int, patch_size: int, embed_dim: int, num_layers: int, num_heads: int, activation: ~typing.Callable = <class 'torch.nn.modules.activation.SiLU'>, cls_output_dim: int = 512, attn_bias: bool = True, out_indices: ~typing.Optional[~typing.List[int]] = None, output_cls_projection: bool = False, max_num_tiles: int = 4, in_channels: int = 3, intermediate_act: ~torch.nn.modules.module.Module = SiLU()) VisionTransformer [source]¶
Builds the vision encoder associated with the clip model. This includes:
TransformerEncoderLayer
positional embeddings
CLS projection (optional)
For details, please check the documentation of
torchtune.modules.vision_transformer.VisionTransformer
.- Parameters:
tile_size (int) – The size of your image tiles, if the image was tile-cropped in advance. Otherwise, the size of the input image. In this case, the function will consider your image as a single tile.
patch_size (int) – The size of each patch. Used to divide the tiles into patches. E.g. for
patch_size=40
, a tile of shape (400, 400) will have 10x10 grid of patches with shape (40, 40) each.embed_dim (int) – The dimensionality of each patch embedding (token).
num_layers (int) – The number of transformer layers.
num_heads (int) – The number of attention heads in each transformer layer.
activation (Callable) – The activation function to use in the MLP layer.
cls_output_dim (int) – The dimensionality of the output tensor from the CLS projection module.
attn_bias (bool) – Boolean for if to use bias in the attention module. Default True.
out_indices (Optional[List[int]]) – The indices of hidden layers to return. If provided, it will return the intermediate results of the transformer layers before they go through a next layer. For example,
out_indices=[0,3]
will return the tokens before they go through the first and fourth layers.output_cls_projection (bool) – If True, only the CLS token projection will be outputted, instead of all tokens. Defaults to False.
max_num_tiles (int) – The maximum number of tiles that can be processed. This is used to determine the size of the positional embeddings.
in_channels (int) – The number of image input channels.
intermediate_act (torch.nn.Module) – The activation function used in the intermediate layers in the transformer encoder.
- Returns:
A VisionTransformer object.
- Raises:
AssertionError – If
embed_dim
is not divisible bynum_heads
.