Llama3VisionEncoder¶
- class torchtune.models.llama3_2_vision.Llama3VisionEncoder(clip: Module, projection_head: Module)[source]¶
Vision encoder model for Llama 3.2 Vision. This combines a pretrained vision encoder with a learnable projection head. The projection head is converted to a fusion module and supports fusion utils.
- Parameters:
clip (nn.Module) – CLIP encoder vision model
projection_head (nn.Module) – projection_head that takes embeddings with dimension encoder_dim as input and outputs embeddings of size decoder_dim.
- forward(images: Tensor, aspect_ratio: Optional[Tensor] = None) Tensor [source]¶
- Parameters:
images (torch.Tensor) – Image tensor with shape [b x i x t x c x w x h]
aspect_ratio (Optional[torch.Tensor]) – Tensor with shape [b x i x 2]. If all images have a single tile, i.e. they were not tile-cropped, it should be None. Used to calculate the positional embeddings for the tiles.
- Returns:
- output tensor of a sequence of embedings [b x s x d]
where sequence length is num_imgs*num_tiles+num_embeds
- Return type:
Tensor
- Notation used for tensor shapes:
b: batch size
i: number of images
t: number of tiles (where a single image is broken into multiple tiles)
c: number of image channels (e.g. rgb = 3)
w: image width
h: image height
s: sequence length computed by i*t*clip_embeds_per_tile
d: embed dim