Shortcuts

TransformerDecoderLayer

class torchtune.modules.TransformerDecoderLayer(attn: CausalSelfAttention, mlp: Module, sa_norm: Module, mlp_norm: Module)[source]

Transformer layer derived from the Llama2 model. Normalization is applied before the attention and FF layer.

Parameters:
  • attn (CausalSelfAttention) – Attention module.

  • mlp (nn.Module) – Feed-forward module.

  • sa_norm (nn.Module) – Normalization to be applied before self-attention.

  • mlp_norm (nn.Module) – Normalization to be applied before the feed-forward layer.

forward(x: Tensor, *, mask: Optional[Tensor] = None, input_pos: Optional[Tensor] = None) Tensor[source]
Parameters:
  • x (Tensor) – input tensor with shape [batch_size x seq_length x embed_dim]

  • mask (Optional[Tensor]) – Optional boolean tensor which contains the attention mask with shape [batch_size x seq_length x seq_length]. This is applied after the query-key multiplication and before the softmax. A value of True in row i and column j means token i attends to token j. A value of False means token i does not attend to token j. If no mask is specified, a causal mask is used by default. Default is None.

  • input_pos (Optional[Tensor]) – Optional tensor which contains the position ids of each token. During training, this is used to indicate the positions of each token relative to its sample when packed, shape [b x s]. During inference, this indicates the position of the current token. If none, assume the index of the token is its position id. Default is None.

Returns:

output tensor with same shape as input

[batch_size x seq_length x embed_dim]

Return type:

Tensor

Todo

  • Make position of norm configurable

Docs

Access comprehensive developer documentation for PyTorch

View Docs

Tutorials

Get in-depth tutorials for beginners and advanced developers

View Tutorials

Resources

Find development resources and get your questions answered

View Resources