lora_llama3_2_vision_encoder

torchtune.models.llama3_2_vision.lora_llama3_2_vision_encoder(encoder_lora: bool, fusion_lora: bool, lora_attn_modules: List[Literal['q_proj', 'k_proj', 'v_proj', 'output_proj']], apply_lora_to_mlp: bool = False, apply_lora_to_output: bool = False, *, patch_size: int, num_heads: int, clip_embed_dim: int, clip_num_layers: int, clip_hidden_states: Optional[List[int]], num_layers_projection: int, decoder_embed_dim: int, tile_size: int, max_num_tiles: int = 4, in_channels: int = 3, lora_rank: int = 8, lora_alpha: float = 16, lora_dropout: float = 0.0, use_dora: bool = False, quantize_base: bool = False, **quantization_kwargs) → Llama3VisionEncoder[source]

Build the Llama 3.2 vision encoder by combining the CLIP image model with an additional projection head fusion module. This includes: - Spatial positional encodings - CLIP model backbone - Projection head on top of CLIP - Final projection into token embedding dimension

Parameters:

encoder_lora (bool) – whether to apply LoRA to the CLIP encoder
fusion_lora (bool) – whether to apply LoRA to the projection head
lora_attn_modules (List[LORA_ATTN_MODULES]) – list of which linear layers LoRA should be applied to in each self-attention block. Options are {"q_proj", "k_proj", "v_proj", "output_proj"}.
apply_lora_to_mlp (bool) – whether to apply LoRA to the MLP in each transformer layer. Default: False
apply_lora_to_output (bool) – whether to apply LoRA to the model’s decoder and encoder output projection. Default: False
patch_size (int) – The size of each patch. Used to divide the tiles into patches. E.g. for patch_size=40, a tile of shape (400, 400) will have 10x10 grid of patches with shape (40, 40) each.
num_heads (int) – The number of attention heads in each transformer layer.
clip_embed_dim (int) – The dimensionality of each patch embedding in CLIP.
clip_num_layers (int) – The number of transformer layers.
clip_hidden_states (Optional[List[int]]) – The indices of CLIP hidden layers to return to return to the encoder projection head. It will return the intermediate results of the vision transformer layers which will be concatenated with the CLIP output and input into the projection head. For example, clip_hidden_states=[0,3] will return the embeddings before they go through the first and fourth layers.
num_layers_projection (int) – The number of transformer layers in the projection head.
decoder_embed_dim (int) – The dimensionality of the final output embeddings for the decoder.
tile_size (int) – The size of your image tiles, if the image was tile-cropped in advance. Otherwise, the size of the input image. In this case, the function will consider your image as a single tile.
max_num_tiles (int) – The maximum number of tiles that can be processed. This is used to determine the size of the positional embeddings.
in_channels (int) – The number of image input channels.
lora_rank (int) – rank of each low-rank approximation
lora_alpha (float) – scaling factor for the low-rank approximation
lora_dropout (float) – LoRA dropout probability. Default: 0.0
use_dora (bool) – Whether to use DoRA layers instead of LoRA layers. Default is False.
quantize_base – (bool): Whether to quantize base model weights or not. Only applied to base weights within linear layers LoRA is applied to. The final output linear projection is not supported for quantization currently.

Returns:

Instantiation of Llama 3.2 vision encoder.

Return type:

Llama3VisionEncoder

lora_llama3_2_vision_encoder

Docs

Tutorials

Resources