Wav2Vec2 Trained on LibriSpeech Data

Transcribe an English audio recording

This family of models was trained using self-supervised learning in order to learn powerful representations from speech audio alone, followed by a fine-tuning on transcribed speech. At training time, Wav2Vec2 encodes raw speech audio into latent speech representations via a multilayer convolutional neural network. Parts of these feature representations are then artificially masked and fed to a transformer network that outputs contextualized representations, and the entire model is trained via a contrastive task where the output of the masked data at masked time steps is penalized for being distant from the true representation. Wav2Vec2 achieves state-of-the-art performance on the full LibriSpeech benchmark for noisy speech, while for the clean 100-hour LibriSpeech setup, it outperforms the previous best result while using 100 times less labeled data.

Training Set Information

The LibriSpeech corpus is a collection of approximately one thousand hours of audiobooks that are a part of the LibriVox project. Most of the audiobooks come from Project Gutenberg.

Model Information

Examples

Download Example Notebook

Open in Wolfram Cloud

Resource retrieval

Get the pre-trained net:

In[1]:=

Out[2]=

NetModel parameters

This model consists of a family of individual nets, each identified by a specific parameter. Inspect the available parameters:

In[3]:=

Out[4]=

Pick a non-default net by specifying the parameters:

In[5]:=

Out[6]=

Pick a non-default uninitialized net:

In[7]:=

Out[8]=

Evaluation function

Define an evaluation function that runs the net and produces the final transcribed text:

In[9]:=

$netevaluate[audio_] := Module[{chars}, chars = NetModel["Wav2Vec2 Trained on LibriSpeech Data"][audio]; StringReplace[StringJoin@chars, "|" -> " "] ]$

Basic usage

Record an audio sample and transcribe it:

In[10]:=

Out[11]=

In[12]:=

Out[12]=

Try it over different audio samples. Notice that the output can contain spelling mistakes, especially with noisy audio. Hence a spellchecker is usually needed as a post-processing step:

In[13]:=

Out[13]=

Feature extraction

Take the feature extractor from the trained net and aggregate the output so that the net produces a vector representation of an audio clip:

In[14]:=

extractor = NetAppend[
NetTake[NetModel["Wav2Vec2 Trained on LibriSpeech Data"], "FeatureExtractor"], "Mean" -> AggregationLayer[Mean, 1]]

Out[15]=

Get a set of utterances in English and Spanish:

In[16]:=

(* Evaluate this cell to get the example input *) CloudGet["https://www.wolframcloud.com/obj/7f452647-8ab1-4beb-a361-2eb460ae4984"]

Visualize the utterances in feature space:

In[17]:=

Out[17]=

Net information

Inspect the sizes of all arrays in the net:

In[18]:=

Out[26]=

Obtain the total number of parameters:

In[27]:=

Out[28]=

Obtain the layer type counts:

In[29]:=

Out[30]=

Display the summary graphic:

In[31]:=

Out[32]=

Requirements

Wolfram Language 13.2 (December 2022) or above

Resource History

Date Created: 12 June 2023

Reference

A. Baevski, H. Zhou, A. Mohamed, M. Auli, "wav2vec 2.0: A Framework for Self-Supervised Learning of Speech Representations," arXiv:2006.11477v3 (2020)
Available from: https://github.com/facebookresearch/fairseq
Rights: MIT License