Wav2Vec2 XLSR-53 Trained on Multilingual Data

Transcribe multiple-language audio recordings

These models are derived from the "Wav2Vec2 Trained on LibriSpeech Data" family. The XLSR family of models learns cross-lingual speech representations by pre-training a single Wav2Vec2 model from the raw waveform of utterances in multiple languages. The resulting model is fine-tuned on labeled data and experiments show that cross-lingual pre-training significantly outperforms monolingual pre-training.

Training Set Information

Model Information

Examples

Resource retrieval

Get the pre-trained net:

In[1]:=
NetModel["Wav2Vec2 XLSR-53 Trained on Multilingual Data"]
Out[2]=

NetModel parameters

This model consists of a family of individual nets, each identified by a specific parameter. Inspect the available parameters:

In[3]:=
NetModel["Wav2Vec2 XLSR-53 Trained on Multilingual Data", "ParametersInformation"]
Out[4]=

Pick a non-default net by specifying the parameters:

In[5]:=
NetModel[{"Wav2Vec2 XLSR-53 Trained on Multilingual Data", "Language" -> "Italian"}]
Out[6]=

Pick a non-default uninitialized net:

In[7]:=
NetModel[{"Wav2Vec2 XLSR-53 Trained on Multilingual Data", "Language" -> "German"}, "UninitializedEvaluationNet"]
Out[8]=

Evaluation function

Define an evaluation function that runs the net and produces the final transcribed text:

In[9]:=
netevaluate[audio_, language_ : "Spanish"] := Module[{chars},
  chars = NetModel[{"Wav2Vec2 XLSR-53 Trained on Multilingual Data", "Language" -> language}][audio];
  StringReplace[StringJoin@chars, "|" -> " "]
  ]

Basic usage

Record an audio sample and transcribe it:

In[10]:=
record = AudioCapture[]
Out[11]=
In[12]:=
netevaluate[record]
Out[12]=

Evaluation for non-default languages

Get a set of utterances in different languages:

In[13]:=
(* Evaluate this cell to get the example input *) CloudGet["https://www.wolframcloud.com/obj/e49d0738-cee8-4492-85dd-962e5aad5d1e"]

Get transcriptions:

In[14]:=
Dataset@Map[netevaluate[Keys@#, Values@#] &, audios]
Out[14]=

Feature extraction

Take the feature extractor from the trained net and aggregate the output so that the net produces a vector representation of an audio clip:

In[15]:=
extractor = NetAppend[
  NetTake[NetModel["WWav2Vec2 XLSR-53 Trained on Multilingual Data"], "FeatureExtractor"], "Mean" -> AggregationLayer[Mean, 1]]
Out[16]=

Get a set of audio clips:

In[17]:=
(* Evaluate this cell to get the example input *) CloudGet["https://www.wolframcloud.com/obj/a5ac5c20-e5f1-4ac4-9cc3-9afa937e6640"]

Visualize the features of a set of audio clips:

In[18]:=
FeatureSpacePlot[audios, FeatureExtractor -> extractor, LabelingSize -> 90, LabelingFunction -> Callout, Method -> "PrincipalComponentsAnalysis"]
Out[18]=

Net information

Inspect the sizes of all arrays in the net:

In[19]:=
Information[
 NetModel[
  "Wav2Vec2 XLSR-53 Trained on Multilingual Data"], "ArraysElementCounts"]
Out[20]=

Obtain the total number of parameters:

In[21]:=
Information[
 NetModel[
  "Wav2Vec2 XLSR-53 Trained on Multilingual Data"], "ArraysTotalElementCount"]
Out[22]=

Obtain the layer type counts:

In[23]:=
Information[
 NetModel[
  "Wav2Vec2 XLSR-53 Trained on Multilingual Data"], "LayerTypeCounts"]
Out[24]=

Display the summary graphic:

In[25]:=
Information[
 NetModel[
  "Wav2Vec2 XLSR-53 Trained on Multilingual Data"], "SummaryGraphic"]
Out[26]=

Requirements

Wolfram Language 13.2 (December 2022) or above

Resource History

Reference