Function Repository Resource:

AttentionBlock

Build a multi-head attention net

Contributed by: Maria Sargsyan

ResourceFunction["AttentionBlock"][dim, n]

constructs a multi-head self-attention network with the n attention heads and hidden dimension dim.

Details and Options

AttentionBlock is a key component in transformer-based neural network architectures. It allows the model to attend to different parts of the input sequence simultaneously by splitting the attention mechanism into multiple heads. Each head focuses on a different subspace of the input, enabling the model to capture diverse features or relationships within the data.

The number of attention heads, n, must divide the dimension dim:

Self-attention focuses on relationships within a single sequence, letting each element attend to all others in the same sequence.

Cross-attention connects two sequences, where one sequence attends to another, learning relationships between them.

RoPE (Rotary Positional Encoding) embeds positional information by rotating query and key vectors, improving generalization to longer sequences.

The following properties are supported:

"Mask"

None

prevent certain patterns of attention

"CrossAttention"

False

allow attention between different sequences

RoPE

False

encode positional information using Rotary Position Embedding

Possible settings for "Mask" are the same as for the AttentionLayer:

None

no masking

"Causal"

causal masking

"Causal"→n

local causal masking with a window of size n

Examples

Basic Examples (2)

Construct a multi-head self-attention block with 8 attention heads and a hidden dimension of 64:

In[1]:=

Out[1]=

Create a randomly initialized self-attention block and run it on a random input sequence of length 5:

In[2]:=

Out[2]=

In[3]:=

Out[3]=

Scope (5)

Attention Weights (5)

Attention blocks are the building blocks of the transformer architectures.

Take a model based on AttentionLayer:

In[4]:=

Out[4]=

This net contains several multi-head self-attention blocks with 12 heads, for instance:

In[5]:=

Out[5]=

The block mentioned above can be constructed by:

In[6]:=

Out[6]=

Extract the attention weights of this layer for a given input:

In[7]:=

input = "The cat liked playing with a ball of yarn";
weights = net[input, NetPort[{"encoder", 1, 1, "attention", -2, "AttentionWeights"}]];
tokens = {"[START]", Splice[StringSplit[input]], "[END]"}

Out[2]=

Visualize attention weights across different heads. Hover over any token on either side to filter attention to or from that token. Line opacity represents the average strength across heads, and colored blocks show head-specific weights:

In[8]:=

$Graphics[{s1, heads, s2} = Dimensions[weights]; avgWeights = Mean[Transpose[weights]]; headColors = Table[Hue[h/Max[heads - 1, 1]], {h, heads}]; lines = Table[{Opacity[3 avgWeights[[i, j]]], Line[{{0, -i}, {4, -j}}]}, {i, s1}, {j, s2}]; lines1 = Flatten[lines, {{1}, {2, 3}}]; lines2 = Flatten[lines, {{2}, {1, 3}}]; attentionColors = Table[Graphics[{Opacity[weights[[i, h, j]], headColors[[h]]], Rectangle[{0, 0}, {1, 1}/8]}, ImageSize -> Scaled[0.1`^14.5`]], {i, s1}, {j, s2}, {h, heads}]; g = Table[Row[attentionColors[[i, j, All]]], {i, s1}, {j, s2}]; attendTo = Table[Inset[g[[i, j]], {8, -j}], {i, s1}, {j, s2}]; toAttend = Table[Inset[g[[i, j]], {-4, -i}], {j, s2}, {i, s1}]; Table[t1 = Text[tokens[[i]], {-0.5`, -i}]; t2 = Text[tokens[[j]], {4.5`, -j}]; {Mouseover[t2, Style[Join[{Lighter[Red], t2}, Riffle[lines2[[j]], Red, {2, -1, 3}], toAttend[[j]]]]], Mouseover[t1, Style[Join[{Lighter[Blue], t1}, Riffle[lines1[[i]], Blue, {2, -1, 3}], attendTo[[i]]]]], Opacity[avgWeights[[i, j]]], Line[{{0, -i}, {4, -j}}]}, {i, s1}, {j, s2}], ImageSize -> 600, FrameTicks -> None, PlotRange -> {{-7, 11}, {-s1 - 1, 0}}]$