Shortcuts

Llama3VisionEncoder

class torchtune.models.llama3_2_vision.Llama3VisionEncoder(clip: Module, projection_head: Module)[source]

Vision encoder model for Llama 3.2 Vision. This combines a pretrained vision encoder with a learnable projection head. The projection head is converted to a fusion module and supports fusion utils.

Parameters:
  • clip (nn.Module) – CLIP encoder vision model

  • projection_head (nn.Module) – projection_head that takes embeddings with dimension encoder_dim as input and outputs embeddings of size decoder_dim.

forward(images: Tensor, aspect_ratio: Optional[Tensor] = None) Tensor[source]
Parameters:
  • images (torch.Tensor) – Image tensor with shape [b x i x t x c x w x h]

  • aspect_ratio (Optional[torch.Tensor]) – Tensor with shape [b x i x 2]. If all images have a single tile, i.e. they were not tile-cropped, it should be None. Used to calculate the positional embeddings for the tiles.

Returns:

output tensor of a sequence of embedings [b x s x d]

where sequence length is num_imgs*num_tiles+num_embeds

Return type:

Tensor

Notation used for tensor shapes:
  • b: batch size

  • i: number of images

  • t: number of tiles (where a single image is broken into multiple tiles)

  • c: number of image channels (e.g. rgb = 3)

  • w: image width

  • h: image height

  • s: sequence length computed by i*t*clip_embeds_per_tile

  • d: embed dim

Docs

Access comprehensive developer documentation for PyTorch

View Docs

Tutorials

Get in-depth tutorials for beginners and advanced developers

View Tutorials

Resources

Find development resources and get your questions answered

View Resources