Trained on
Multiple Datasets
These models are derived from the "Wav2Vec2 Trained on LibriSpeech Data" family. They explore more general setups where the domain of the unlabeled data for pre-training data differs from the domain of the labeled data for fine-tuning, which may differ from the test data domain. The results show that pre-training on multiple domains improves generalization performance on domains not seen during training. The models are pre-trained using a single large Wav2Vec2 model on four domains (Libri-Light, Switchboard, Fisher and Common Voice) and fine-tuned on the LibriSpeech and Switchboard datasets.
Resource retrieval
Get the pre-trained net:
NetModel parameters
This model consists of a family of individual nets, each identified by a specific parameter. Inspect the available parameters:
Pick a non-default net by specifying the parameters:
Pick a non-default uninitialized net:
Evaluation function
Define an evaluation function that runs the net and produces the final transcribed text:
Basic usage
Record an audio sample and transcribe it:
Try it over different audio samples. Notice that the output can contain spelling mistakes, especially with noisy audio. Hence a spellchecker is usually needed as a post-processing step:
Feature extraction
Take the feature extractor from the trained net and aggregate the output so that the net produces a vector representation of an audio clip:
Get a set of utterances in various languages:
Visualize the features of a set of audio clips:
Net information
Inspect the sizes of all arrays in the net:
Obtain the total number of parameters:
Obtain the layer type counts:
Display the summary graphic:
Wolfram Language
(December 2022)
or above
Resource History
W.-N. Hsu, A. Sriram, A. Baevski, T. Likhomanenko, Q. Xu, V. Pratap, J. Kahn, A. Lee, R. Collobert, G. Synnaeve, M. Auli, "Robust wav2vec 2.0: Analyzing Domain Shift in Self-Supervised Pre-training," arXiv:2104.01027 (2021)
- Available from: https://github.com/facebookresearch/fairseq
MIT License