lora_mistral_classifier¶
- torchtune.models.mistral.lora_mistral_classifier(lora_attn_modules: List[Literal['q_proj', 'k_proj', 'v_proj', 'output_proj']], apply_lora_to_mlp: bool = False, apply_lora_to_output: bool = False, *, num_classes: int, vocab_size: int, num_layers: int, num_heads: int, num_kv_heads: int, embed_dim: int, max_seq_len: int, intermediate_dim: int, attn_dropout: float = 0.0, norm_eps: float = 1e-05, rope_base: int = 10000, lora_rank: int, lora_alpha: float, lora_dropout: float = 0.0, use_dora: bool = False, quantize_base: bool = False) TransformerDecoder [source]¶
Return a version of Mistral classifier (an instance of
TransformerDecoder()
) with LoRA applied to some of the linear layers in its self-attention modules.- Parameters:
lora_attn_modules (List[LORA_ATTN_MODULES]) – list of which linear layers LoRA should be applied to in each self-attention block. Options are
{"q_proj", "k_proj", "v_proj", "output_proj"}
.apply_lora_to_mlp (bool) – whether to apply LoRA to the MLP in each transformer layer. Default: False
apply_lora_to_output (bool) – whether to apply LoRA to the model’s final output projection. Default: False
num_classes (int) – number of classes for the classification layer.
vocab_size (int) – number of tokens in vocabulary.
num_layers (int) – number of layers in the transformer decoder.
num_heads (int) – number of query heads. For MHA this is also the number of heads for key and value
num_kv_heads (int) – number of key and value heads. User should ensure num_heads % num_kv_heads == 0. For standard MHA set num_kv_heads == num_heads, for GQA num_kv_heads < num_heads, and for MQA set num_kv_heads == 1.
embed_dim (int) – embedding dimension for self-attention
max_seq_len (int) – maximum sequence length the model will be run with
intermediate_dim (int) – intermediate dimension for MLP.
attn_dropout (float) – dropout value passed onto scaled_dot_product_attention. Default: 0.0
norm_eps (float) – epsilon in RMS norms.
rope_base (int) – base for the rotary positional embeddings. Default: 10_000
lora_rank (int) – rank of each low-rank approximation
lora_alpha (float) – scaling factor for the low-rank approximation
lora_dropout (float) – LoRA dropout probability. Default: 0.0
use_dora (bool) – Decompose the LoRA weight into magnitude and direction, as introduced in “DoRA: Weight-Decomposed Low-Rank Adaptation” (https://arxiv.org/abs/2402.09353).
quantize_base – (bool): Whether to quantize base model weights or not. Only applied to base weights within linear layers LoRA is applied to. The final output linear projection is not supported for quantization currently.
- Returns:
Instantiation of Mistral classifier model with LoRA applied to a subset of the attention projections in each layer.
- Return type: