M2M-100 Multilingual Translation Net

Translate text between any pair of one hundred languages

Released in 2020 by researchers at Facebook, this family of multilingual translation models is based on the transformer architecture that can translate directly between any pair of one hundred languages. The training data was built using a complex combination of data mining, data augmentation and postprocessing techniques applied to different multilingual datasets. The resulting dataset is comprised of 7.5 billion training sentences directly covering thousands of language pairs. Scaling a translation model to such a large dataset and scope is a nontrivial problem that was addressed via a thorough exploration of dense scaling and the introduction of language-specific layers during training.

Training Set Information

CCMatrix, a dataset consisting of 4.5 billion parallel sentences in 576 language pairs pulled from snapshots of the CommonCrawl public dataset. CCAligned, consisting of parallel or comparable web-document pairs in 137 languages aligned with English. Several mining and data augmentation techniques were used on the datasets.

Model Information

Examples

Download Example Notebook

Open in Wolfram Cloud

Resource retrieval

Get the pre-trained net:

In[1]:=

Out[1]=

NetModel parameters

This model consists of a family of individual nets, each identified by a specific parameter combination. Inspect the available parameters:

In[2]:=

Out[2]=

Pick a non-default net by specifying the parameters:

In[3]:=

Out[3]=

Pick a non-default uninitialized net:

In[4]:=

Out[4]=

Evaluation function

Write an evaluation function to combine the encoder and decoder nets into a full translation pipeline:

In[5]:=

$decodestepfunction[decoder_, encoderOutput_] := Function@Block[ {states, ind, decoderOutput}, states = KeyMap[StringReplace["OutState" -> "State"], KeyDrop[#PrevOutput, "Output"]]; ind = #Index + 1; decoderOutput = decoder@Join[ <|"Prev" -> #PrevOutput["Output"], "Input" -> encoderOutput, "Index" -> ind|>, states ]; <|"PrevOutput" -> decoderOutput, "Index" -> ind|> ]$

In[6]:=

$netevaluate[text_, sourceLanguage_, targetLanguage_, modelSize_ : "Small", maxSteps_ : 500] := Block[ {encoderOutput, states, firstOutput, result}, $encoder = NetModel[{"M2M-100 Multilingual Translation Net", "Size" -> modelSize, "Part" -> "Encoder"}]; $decoder = NetModel[{"M2M-100 Multilingual Translation Net", "Size" -> modelSize, "Part" -> "Decoder"}]; (* Encode the input text *) encoderOutput = $encoder[{"__" <> sourceLanguage <> "__", text}]; (* Decoder initialization step *) states = Association@Table[ "State" <> ToString[i] -> {}, {i, Length[Information[$decoder, "OutputPortNames"]] - 1} ]; firstOutput = $decoder@Join[ <|"Prev" -> EndOfString, "Input" -> encoderOutput, "Index" -> 1|>, states ]; (* Overwrite the first generated token with the target language *) firstOutput = Append[firstOutput, "Output" -> "__" <> targetLanguage <> "__"]; (* Output generation loop *) result = NestWhileList[ decodestepfunction[$decoder, encoderOutput], <|"PrevOutput" -> firstOutput, "Index" -> 1|>, #PrevOutput["Output"] =!= EndOfString &, 1, maxSteps ]; StringTrim@ StringJoin@result[[2 ;; -2, Key["PrevOutput"], Key["Output"]]] ]$

Basic usage

Translate a piece of text from English to Spanish:

In[7]:=

In[8]:=

Out[8]=

Check translations to several languages:

In[9]:=

Out[10]=

Translate back to English:

In[11]:=

Out[11]=

Get a list of the available languages:

In[12]:=

$Sort@StringTrim[ NetExtract[ NetModel[{"M2M-100 Multilingual Translation Net", "Part" -> "Decoder"}], {"Prev", "Labels"}][[-108 ;; -9]], "__"]$

Out[12]=

Feature extraction

Consider two sentences in several different languages:

In[13]:=

languageSentences = {"English" -> "I can speak many languages", "Spanish" -> "Puedo hablar muchas lenguas", "Chinese" -> "我可以说很多语言", "French" -> "Je peux parler de nombreuses langues", "German" -> "Ich kann viele Sprachen sprechen", "Italian" -> "Posso parlare molte lingue", "Japanese" -> "いろいろな言語を話せる", "Russian" -> "Я могу говорить на многих языках"};

In[14]:=

cookingSentences = {"English" -> "I can cook many recipes", "Spanish" -> "Puedo cocinar muchas recetas", "Arabic" -> "يمكنني طهي العديد من الوصفات", "Chinese" -> "我可以做很多种菜", "French" -> "Je peux cuisiner de nombreuses recettes", "German" -> "Ich kann viele Rezepte kochen", "Italian" -> "Posso cucinare molte ricette", "Japanese" -> "いろんな料理が作れます", "Russian" -> "Можно приготовить множество рецептов"};

Visualize the sentences in feature space, using the encoder net as a feature extractor:

In[15]:=

$FeatureSpacePlot[{"__" <> First[#] <> "__", Last[#]} -> Last[#] & /@ Join[languageSentences, cookingSentences], FeatureExtractor -> NetModel["M2M-100 Multilingual Translation Net"], LabelingFunction -> Callout]$

Out[15]=

Translation procedure

The translation pipeline makes use of two separate transformer nets, encoder and decoder:

In[16]:=

Out[16]=

In[17]:=

Out[17]=

The encoder net features a "Function" NetEncoder that combines two net encoders. A "Class" NetEncoder encodes the source language into an integer code and a "SubwordTokens" NetEncoder performs the BPE segmentation of the input text, still producing integer codes:

In[18]:=

Out[18]=

In[19]:=

Out[19]=

The source language (which has to be wrapped in underscores) is encoded into a single integer between 128,005 and 128,104, while the source text is encoded into a variable number of integers between 1 and 128,000. The special code 3 is appended at the end, acting as a control code signaling the end of the sentence:

In[20]:=

In[21]:=

$codes = innerFunction[{"__English__", inputText}]$

Out[21]=

The encoder net is ran once, producing a length-1024 semantic vector for each input code:

In[22]:=

In[23]:=

In[24]:=

Out[24]=

Out[25]=

The decoding step involves running the decoder net several times in a recursive fashion, where each evaluation produces a subword token of the translated sentence. The decoder net has several inputs:

In[26]:=

Out[26]=

• Port "Input" takes the encoded features produced by the encoder. The data fed to this input is the same for every evaluation of the decoder.

• Port "Prev" takes the subword token generated by the previous evaluation of the decoder. Tokens are converted to integer codes by a "Class" NetEncoder.

• Port "Index" takes an integer keeping count of how many times the decoder was evaluated (positional encoding).

• Ports "State1", "State2" ... take the self-attention key and value arrays for all the past tokens. Their size grows by one at each evaluation. The default ("Size" -> "Small") decoder has 12 attention blocks, which makes for 24 states: 12 key arrays and 12 value arrays.

For the first evaluation of the decoder, port "Prev" takes EndOfString as input (which is converted to the control code 3), port "Index" takes the index 1 and the state ports take empty sequences. Perform the initial run of the decoder with all the initial inputs:

In[27]:=

In[28]:=

firstOutput = decoder@Join[
<|"Prev" -> EndOfString, "Input" -> encodedFeatures, "Index" -> 1|>,
emptyStates
];

The "Output" key of the decoder output contains the generated token. For the first evaluation, it is a language token that has no meaning and gets ignored:

In[29]:=

Out[29]=

The other keys of the decoder output contain new states that will be fed back as input in the next evaluation. Key and value arrays have dimensions {16, 64}, and the value of the first dimension is 1, which shows that only one evaluation was performed:

In[30]:=

Out[30]=

The second run is where the first subword token of the output is generated. For this step, the "Prev" input takes the target language. It will take the previous token for all subsequent evaluations:

In[31]:=

In[32]:=

secondOutput = decoder@Join[
<|"Prev" -> "__Italian__", "Input" -> encodedFeatures, "Index" -> 2|>,
states
];

Check the generated token and check that the length of the output states is now 2:

In[33]:=

Out[33]=

In[34]:=

Out[34]=

The recursion keeps going until the EndOfString token is generated:

In[35]:=

In[36]:=

Out[36]=

The final output is obtained by concatenating all tokens. Check the translation result alongside the starting sentence:

In[37]:=

Out[37]=

Out[38]=

Net information

Inspect the number of parameters of all arrays in the net:

In[39]:=

Out[39]=

Obtain the total number of parameters:

In[40]:=

Out[40]=

Obtain the layer type counts:

In[41]:=

Out[41]=

Display the summary graphic:

In[42]:=

Out[42]=

Export to ONNX

Export the net to the ONNX format:

In[43]:=

Out[43]=

Get the size of the ONNX file:

In[44]:=

Out[44]=

The byte count of the resource object is smaller because shared arrays are currently being duplicated when exporting to ONNX:

In[45]:=

Out[45]=

Check some metadata of the ONNX model:

In[46]:=

Out[46]=

Construction Notebook

Download Construction Notebook

Open in Wolfram Cloud

Resource History

Date Created: 16 December 2022

Reference

A. Fan, et al., "Beyond English-Centric Multilingual Machine Translation," arXiv:2010.11125 (2020)
Available from: https://github.com/pytorch/fairseq/tree/master/examples/m2m_100
Rights: MIT License