Translation procedure
The translation pipeline makes use of two separate transformer nets, encoder and decoder:
The encoder net features a "Function" NetEncoder that combines two net encoders. A "Class" NetEncoder encodes the source language into an integer code and a "SubwordTokens" NetEncoder performs the BPE segmentation of the input text, still producing integer codes:
The source language (which has to be wrapped in underscores) is encoded into a single integer between 128,005 and 128,104, while the source text is encoded into a variable number of integers between 1 and 128,000. The special code 3 is appended at the end, acting as a control code signaling the end of the sentence:
The encoder net is ran once, producing a length-1024 semantic vector for each input code:
The decoding step involves running the decoder net several times in a recursive fashion, where each evaluation produces a subword token of the translated sentence. The decoder net has several inputs:
• Port "Input" takes the encoded features produced by the encoder. The data fed to this input is the same for every evaluation of the decoder.
• Port "Prev" takes the subword token generated by the previous evaluation of the decoder. Tokens are converted to integer codes by a "Class" NetEncoder.
• Port "Index" takes an integer keeping count of how many times the decoder was evaluated (positional encoding).
• Ports "State1", "State2" ... take the self-attention key and value arrays for all the past tokens. Their size grows by one at each evaluation. The default ("Size" -> "Small") decoder has 12 attention blocks, which makes for 24 states: 12 key arrays and 12 value arrays.
For the first evaluation of the decoder, port "Prev" takes EndOfString as input (which is converted to the control code 3), port "Index" takes the index 1 and the state ports take empty sequences. Perform the initial run of the decoder with all the initial inputs:
The "Output" key of the decoder output contains the generated token. For the first evaluation, it is a language token that has no meaning and gets ignored:
The other keys of the decoder output contain new states that will be fed back as input in the next evaluation. Key and value arrays have dimensions {16, 64}, and the value of the first dimension is 1, which shows that only one evaluation was performed:
The second run is where the first subword token of the output is generated. For this step, the "Prev" input takes the target language. It will take the previous token for all subsequent evaluations:
Check the generated token and check that the length of the output states is now 2:
The recursion keeps going until the EndOfString token is generated:
The final output is obtained by concatenating all tokens. Check the translation result alongside the starting sentence: