Function Repository Resource:

AttentionBlock

Source Notebook

Build a multi-head attention net

Contributed by: Maria Sargsyan

ResourceFunction["AttentionBlock"][dim, n]

constructs a multi-head self-attention network with the n attention heads and hidden dimension dim.

Details and Options

AttentionBlock is a key component in transformer-based neural network architectures. It allows the model to attend to different parts of the input sequence simultaneously by splitting the attention mechanism into multiple heads. Each head focuses on a different subspace of the input, enabling the model to capture diverse features or relationships within the data.
The number of attention heads, n, must divide the dimension dim:
Self-attention focuses on relationships within a single sequence, letting each element attend to all others in the same sequence.
Cross-attention connects two sequences, where one sequence attends to another, learning relationships between them.
RoPE (Rotary Positional Encoding) embeds positional information by rotating query and key vectors, improving generalization to longer sequences.
The following properties are supported:
"Mask"Noneprevent certain patterns of attention
"CrossAttention"Falseallow attention between different sequences
RoPEFalseencode positional information using Rotary Position Embedding
Possible settings for "Mask" are the same as for the AttentionLayer:
Noneno masking
"Causal"causal masking
"Causal"nlocal causal masking with a window of size n

Examples

Basic Examples (2) 

Construct a multi-head self-attention block with 8 attention heads and a hidden dimension of 64:

In[1]:=
ResourceFunction["AttentionBlock"][64, 8]
Out[1]=

Create a randomly initialized self-attention block and run it on a random input sequence of length 5:

In[2]:=
net = NetInitialize@
  NetReplacePart[ResourceFunction["AttentionBlock"][64, 8], "Input" -> {"Varying", 64}]
Out[2]=
In[3]:=
net[RandomReal[1, {5, 64}]] // Short
Out[3]=

Scope (5) 

Attention Weights (5) 

Attention blocks are the building blocks of the transformer architectures.

Take a model based on AttentionLayer:

In[4]:=
net = NetModel["BERT Trained on BookCorpus and Wikipedia Data"]
Out[4]=

This net contains several multi-head self-attention blocks with 12 heads, for instance:

In[5]:=
NetExtract[net, {"encoder", 1, 1, "attention"}]
Out[5]=

The block mentioned above can be constructed by:

In[6]:=
ResourceFunction["AttentionBlock"][768, 12, "Mask" -> None]
Out[6]=

Extract the attention weights of this layer for a given input:

In[7]:=
input = "The cat liked playing with a ball of yarn";
weights = net[input, NetPort[{"encoder", 1, 1, "attention", -2, "AttentionWeights"}]];
tokens = {"[START]", Splice[StringSplit[input]], "[END]"}
Out[2]=

Visualize attention weights across different heads. Hover over any token on either side to filter attention to or from that token. Line opacity represents the average strength across heads, and colored blocks show head-specific weights:

In[8]:=
Graphics[{s1, heads, s2} = Dimensions[weights]; avgWeights = Mean[Transpose[weights]]; headColors = Table[Hue[h/Max[heads - 1, 1]], {h, heads}]; lines = Table[{Opacity[3 avgWeights[[i, j]]], Line[{{0, -i}, {4, -j}}]}, {i, s1}, {j, s2}]; lines1 = Flatten[lines, {{1}, {2, 3}}]; lines2 = Flatten[lines, {{2}, {1, 3}}]; attentionColors = Table[Graphics[{Opacity[weights[[i, h, j]], headColors[[h]]], Rectangle[{0, 0}, {1, 1}/8]}, ImageSize -> Scaled[0.1`^14.5`]], {i, s1}, {j, s2}, {h, heads}]; g = Table[Row[attentionColors[[i, j, All]]], {i, s1}, {j, s2}]; attendTo = Table[Inset[g[[i, j]], {8, -j}], {i, s1}, {j, s2}]; toAttend = Table[Inset[g[[i, j]], {-4, -i}], {j, s2}, {i, s1}]; Table[t1 = Text[tokens[[i]], {-0.5`, -i}]; t2 = Text[tokens[[j]], {4.5`, -j}]; {Mouseover[t2, Style[Join[{Lighter[Red], t2}, Riffle[lines2[[j]], Red, {2, -1, 3}], toAttend[[j]]]]], Mouseover[t1, Style[Join[{Lighter[Blue], t1}, Riffle[lines1[[i]], Blue, {2, -1, 3}], attendTo[[i]]]]], Opacity[avgWeights[[i, j]]], Line[{{0, -i}, {4, -j}}]}, {i, s1}, {j, s2}], ImageSize -> 600, FrameTicks -> None, PlotRange -> {{-7, 11}, {-s1 - 1, 0}}]
Out[8]=

Options (4) 

Enable causal masking:

In[9]:=
ResourceFunction["AttentionBlock"][64, 8, "Mask" -> "Causal"]
Out[9]=

Enable cross-attention for attending to external sequences:

In[10]:=
ResourceFunction["AttentionBlock"][64, 8, "CrossAttention" -> True]
Out[10]=

Enable Rotary Position Embedding for positional encoding:

In[11]:=
ResourceFunction["AttentionBlock"][64, 8, "RoPE" -> True]
Out[11]=

Enable Rotary Position Embedding with causal masking:

In[12]:=
ResourceFunction["AttentionBlock"][64, 8, "RoPE" -> True, "Mask" -> "Causal"]
Out[12]=

Possible Issues (2) 

The number of attention heads must divide the hidden dimension:

In[13]:=
ResourceFunction["AttentionBlock"][64, 7]
Out[13]=

"CrossAttention" and "RoPE" cannot be simultaneously enabled:

In[14]:=
ResourceFunction["AttentionBlock"][64, 8, "CrossAttention" -> True, "RoPE" -> True]
Out[14]=

Requirements

Wolfram Language 13.0 (December 2021) or above

Version History

  • 1.0.0 – 11 December 2024

License Information