llama3_2_vision_decoder¶

torchtune.models.llama3_2_vision.llama3_2_vision_decoder(*, vocab_size: int, num_layers: int, fusion_interval: int, num_special_tokens: int, num_heads: int, num_kv_heads: int, embed_dim: int, max_seq_len: int, encoder_max_seq_len: int, rope_base: int = 500000.0, intermediate_dim: Optional[int] = None) → TransformerDecoder[source]¶

Build the decoder associated with the Llama3 model with additional fused cross attention layers. This includes: - Token embeddings - num_layers number of CausalSelfAttention blocks - Fused cross attention layers every fusion_interval number of layers - RMS Norm layer applied to the output of the transformer - Final projection into token space

Parameters:

vocab_size (int) – number of tokens in vocabulary.
num_layers (int) – number of layers in the transformer decoder.
fusion_interval (int) – interval number of layers between fusion layers.
num_special_tokens (int) – number of special tokens added for the fusion model.
num_heads (int) – number of query heads. For MHA this is also the number of heads for key and value.
num_kv_heads (int) – number of key and value heads. User should ensure num_heads % num_kv_heads == 0. For standard MHA set num_kv_heads == num_heads, for GQA num_kv_heads < num_heads, and for MQA set num_kv_heads == 1.
embed_dim (int) – embedding dimension for self-attention.
max_seq_len (int) – maximum sequence length the model will be run with, as used by KVCache().
encoder_max_seq_len (int) – maximum sequence length the encoder will be run with, as used by KVCache().
intermediate_dim (Optional[int]) – intermediate dimension for MLP. If not specified, this is computed using scale_hidden_dim_for_mlp().

Returns:

Instantiation of Llama 3.2 vision decoder.

Return type:

TransformerDecoder

llama3_2_vision_decoder¶

Docs

Tutorials

Resources