

class torchtune.modules.TransformerDecoderLayer(attn: CausalSelfAttention, mlp: Module, sa_norm: Module, mlp_norm: Module)[source]

Transformer layer derived from the Llama2 model. Normalization is applied before the attention and FF layer.

  • attn (CausalSelfAttention) – Attention module.

  • mlp (nn.Module) – Feed-forward module.

  • sa_norm (nn.Module) – Normalization to be applied before self-attention.

  • mlp_norm (nn.Module) – Normalization to be applied before the feed-forward layer.

forward(x: Tensor, *, mask: Optional[Tensor] = None, input_pos: Optional[Tensor] = None) Tensor[source]
  • x (Tensor) – input tensor with shape [batch_size x seq_length x embed_dim]

  • mask (Optional[Tensor]) – Optional boolean tensor which contains the attention mask with shape [batch_size x seq_length x seq_length]. This is applied after the query-key multiplication and before the softmax. A value of True in row i and column j means token i attends to token j. A value of False means token i does not attend to token j. If no mask is specified, a causal mask is used by default. Default is None.

  • input_pos (Optional[Tensor]) – Optional tensor which contains the position ids of each token. During training, this is used to indicate the positions of each token relative to its sample when packed, shape [b x s]. During inference, this indicates the position of the current token. If none, assume the index of the token is its position id. Default is None.


output tensor with same shape as input

[batch_size x seq_length x embed_dim]

Return type:



  • Make position of norm configurable


Access comprehensive developer documentation for PyTorch

View Docs


Get in-depth tutorials for beginners and advanced developers

View Tutorials


Find development resources and get your questions answered

View Resources