Details and Options
AttentionBlock is a key component in transformer-based neural network architectures. It allows the model to attend to different parts of the input sequence simultaneously by splitting the attention mechanism into multiple heads. Each head focuses on a different subspace of the input, enabling the model to capture diverse features or relationships within the data.
The number of attention heads, n, must divide the dimension dim:
Self-attention focuses on relationships within a single sequence, letting each element attend to all others in the same sequence.
Cross-attention connects two sequences, where one sequence attends to another, learning relationships between them.
RoPE (Rotary Positional Encoding) embeds positional information by rotating query and key vectors, improving generalization to longer sequences.
The following properties are supported:
"Mask" | None | prevent certain patterns of attention |
"CrossAttention" | False | allow attention between different sequences |
RoPE | False | encode positional information using Rotary Position Embedding |
None | no masking |
"Causal" | causal masking |
"Causal"→n | local causal masking with a window of size n |