gemma¶
- torchtune.models.gemma.gemma(vocab_size: int, num_layers: int, num_heads: int, head_dim: int, num_kv_heads: int, embed_dim: int, intermediate_dim: int, max_seq_len: int, attn_dropout: float = 0.0, norm_eps: float = 1e-06, rope_base: int = 10000, norm_embeddings: bool = True) GemmaTransformerDecoder [source]¶
Build the decoder associated with the gemma model. This includes: - Token embeddings - num_layers number of TransformerDecoderLayer blocks - RMS Norm layer applied to the output of the transformer - Final projection into token space
This does NOT currently include inference-time optimizations such as sliding-window attention
- Parameters:
vocab_size (int) – number of tokens in vocabulary.
num_layers (int) – number of layers in the transformer decoder.
num_heads (int) – number of query heads. For MHA this is also the number of heads for key and value
head_dim (int) – dimension of head
num_kv_heads (int) – number of key and value heads.
embed_dim (int) – embedding dimension for self-attention
intermediate_dim (int) – intermediate dimension for MLP
max_seq_len (int) – maximum sequence length the model will be run with,
attn_dropout (float) – dropout value passed onto scaled_dot_product_attention. Default: 0.0
norm_eps (float) – epsilon in RMS norms Default: 1e-6
rope_base (int) – base for the rotary positional embeddings. Default: 10_000
norm_embeddings (bool) – whether to apply layer norm before the self-attention and mlp layers. Default: True
- Returns:
Instantiation of gemma model.
- Return type:
GemmaTransformerDecoder