mistral_classifier¶

torchtune.models.mistral.mistral_classifier(num_classes: int, *, vocab_size: int, num_layers: int, num_heads: int, num_kv_heads: int, embed_dim: int, intermediate_dim: int, max_seq_len: int, attn_dropout: float = 0.0, norm_eps: float = 1e-05, rope_base: int = 10000) → TransformerDecoder[source]¶

Build a base mistral model with an added classification layer. See mistral_classifier() for details on the base mistral classifier model.

Parameters:

num_classes (int) – number of classes for the classification layer.
vocab_size (int) – number of tokens in vocabulary.
num_layers (int) – number of layers in the transformer decoder.
num_heads (int) – number of query heads. For MHA this is also the number of heads for key and value
num_kv_heads (int) – number of key and value heads. User should ensure num_heads % num_kv_heads == 0. For standard MHA set num_kv_heads == num_heads, for GQA num_kv_heads < num_heads, and for MQA set num_kv_heads == 1.
embed_dim (int) – embedding dimension for self-attention
intermediate_dim (int) – intermediate dimension for MLP
max_seq_len (int) – maximum sequence length the model will be run with,
attn_dropout (float) – dropout value passed onto scaled_dot_product_attention. Default: 0.0
norm_eps (float) – epsilon in RMS norms
rope_base (int) – base for the rotary positional embeddings. Default: 10_000

Returns:

Instantiation of mistral classification model.

Return type:

TransformerDecoder

mistral_classifier¶

Docs

Tutorials

Resources