Shortcuts

clip_vision_encoder

torchtune.models.clip.clip_vision_encoder(tile_size: int, patch_size: int, embed_dim: int, num_layers: int, num_heads: int, activation: ~typing.Callable = <class 'torch.nn.modules.activation.SiLU'>, cls_output_dim: int = 512, attn_bias: bool = True, out_indices: ~typing.Optional[~typing.List[int]] = None, output_cls_projection: bool = False, max_num_tiles: int = 4, in_channels: int = 3, intermediate_act: ~torch.nn.modules.module.Module = SiLU()) VisionTransformer[source]

Builds the vision encoder associated with the clip model. This includes:

  • TransformerEncoderLayer

  • positional embeddings

  • CLS projection (optional)

For details, please check the documentation of torchtune.modules.vision_transformer.VisionTransformer.

Parameters:
  • tile_size (int) – The size of your image tiles, if the image was tile-cropped in advance. Otherwise, the size of the input image. In this case, the function will consider your image as a single tile.

  • patch_size (int) – The size of each patch. Used to divide the tiles into patches. E.g. for patch_size=40, a tile of shape (400, 400) will have 10x10 grid of patches with shape (40, 40) each.

  • embed_dim (int) – The dimensionality of each patch embedding (token).

  • num_layers (int) – The number of transformer layers.

  • num_heads (int) – The number of attention heads in each transformer layer.

  • activation (Callable) – The activation function to use in the MLP layer.

  • cls_output_dim (int) – The dimensionality of the output tensor from the CLS projection module.

  • attn_bias (bool) – Boolean for if to use bias in the attention module. Default True.

  • out_indices (Optional[List[int]]) – The indices of hidden layers to return. If provided, it will return the intermediate results of the transformer layers before they go through a next layer. For example, out_indices=[0,3] will return the tokens before they go through the first and fourth layers.

  • output_cls_projection (bool) – If True, only the CLS token projection will be outputted, instead of all tokens. Defaults to False.

  • max_num_tiles (int) – The maximum number of tiles that can be processed. This is used to determine the size of the positional embeddings.

  • in_channels (int) – The number of image input channels.

  • intermediate_act (torch.nn.Module) – The activation function used in the intermediate layers in the transformer encoder.

Returns:

A VisionTransformer object.

Raises:

AssertionError – If embed_dim is not divisible by num_heads.

Docs

Access comprehensive developer documentation for PyTorch

View Docs

Tutorials

Get in-depth tutorials for beginners and advanced developers

View Tutorials

Resources

Find development resources and get your questions answered

View Resources