Wav2Vec2 XLSR-53 Trained on Multilingual Data

Transcribe multiple-language audio recordings

These models are derived from the "Wav2Vec2 Trained on LibriSpeech Data" family. The XLSR family of models learns cross-lingual speech representations by pre-training a single Wav2Vec2 model from the raw waveform of utterances in multiple languages. The resulting model is fine-tuned on labeled data and experiments show that cross-lingual pre-training significantly outperforms monolingual pre-training.

Training Set Information

The Common Voice dataset is a multilingual corpus of read speech comprising more than two thousand hours of speech data in 38 languages. The amount of data per language ranges from three hours for Swedish to 1350 hours for English.

Model Information

Examples

Download Example Notebook

Open in Wolfram Cloud

Resource retrieval

Get the pre-trained net:

In[1]:=

Out[2]=

NetModel parameters

This model consists of a family of individual nets, each identified by a specific parameter. Inspect the available parameters:

In[3]:=

Out[4]=

Pick a non-default net by specifying the parameters:

In[5]:=

Out[6]=

Pick a non-default uninitialized net:

In[7]:=

Out[8]=

Evaluation function

Define an evaluation function that runs the net and produces the final transcribed text:

In[9]:=

$netevaluate[audio_, language_ : "Spanish"] := Module[{chars}, chars = NetModel[{"Wav2Vec2 XLSR-53 Trained on Multilingual Data", "Language" -> language}][audio]; StringReplace[StringJoin@chars, "|" -> " "] ]$

Basic usage

Record an audio sample and transcribe it:

In[10]:=

Out[11]=

In[12]:=

Out[12]=

Evaluation for non-default languages

Get a set of utterances in different languages:

In[13]:=

(* Evaluate this cell to get the example input *) CloudGet["https://www.wolframcloud.com/obj/e49d0738-cee8-4492-85dd-962e5aad5d1e"]

Get transcriptions:

In[14]:=

Out[14]=

Feature extraction

Take the feature extractor from the trained net and aggregate the output so that the net produces a vector representation of an audio clip:

In[15]:=

extractor = NetAppend[
NetTake[NetModel["WWav2Vec2 XLSR-53 Trained on Multilingual Data"], "FeatureExtractor"], "Mean" -> AggregationLayer[Mean, 1]]

Out[16]=

Get a set of audio clips:

In[17]:=

(* Evaluate this cell to get the example input *) CloudGet["https://www.wolframcloud.com/obj/a5ac5c20-e5f1-4ac4-9cc3-9afa937e6640"]

Visualize the features of a set of audio clips:

In[18]:=

Out[18]=

Net information

Inspect the sizes of all arrays in the net:

In[19]:=

Out[20]=

Obtain the total number of parameters:

In[21]:=

Out[22]=

Obtain the layer type counts:

In[23]:=

Out[24]=

Display the summary graphic:

In[25]:=

Out[26]=

Requirements

Wolfram Language 13.2 (December 2022) or above

Resource History

Date Created: 11 June 2023

Reference

A. Conneau, A. Baevski, R. Collobert, A. Mohamed, M. Auli, "Unsupervised Cross-Lingual Representation Learning for Speech Recognition," arXiv:2006.13979V2 (2020)
Available from: https://github.com/facebookresearch/fairseq
Rights: MIT License