VGGish Feature Extractor Trained on YouTube Data

Represent sounds as a sequence of vectors

Released by Google in 2017, this model extracts 128-dimensional embeddings from ~1 second long audio signals. The model was trained on a large YouTube dataset (a preliminary version of what later became YouTube-8M).

Number of layers: 25 | Parameter count: 72,141,184 | Trained size: 289 MB |

Training Set Information

Preliminary version of the YouTube-8M dataset, a large-scale labeled video dataset that consists of millions of YouTube video IDs, with high-quality machine-generated annotations from a diverse vocabulary of 3,800+ visual entities.

Examples

Download Example Notebook

Open in Wolfram Cloud

Resource retrieval

Get the pre-trained net:

In[1]:=

Out[1]=

Basic usage

Extract semantic features from an Audio object:

In[2]:=

Out[2]=

The extracted features are a sequence of 128-dimensional vectors of unsigned 8-bit integers. Visualize the relationship between the sounds using the network as a feature extractor:

In[3]:=

FeatureSpacePlot[
Callout[ExampleData[#], #[[2]]] & /@ ExampleData["Audio"], FeatureExtractor -> NetModel["VGGish Feature Extractor Trained on YouTube Data"], LabelingFunction -> None]

Out[3]=

The network output from the network itself is passed to a PCA transformation and a quantization step. To obtain the raw output of the net, remove the NetDecoder:

In[4]:=

Out[4]=

Extract the raw features:

In[5]:=

Out[5]=

Visualize the relationship between the sounds using the raw features as a feature extractor:

In[6]:=

Out[6]=

Net information

Inspect the number of parameters of all arrays in the net:

In[7]:=

$NetInformation[ NetModel["VGGish Feature Extractor Trained on YouTube Data"], \ "ArraysElementCounts"]$

Out[7]=

Obtain the total number of parameters:

In[8]:=

$NetInformation[ NetModel["VGGish Feature Extractor Trained on YouTube Data"], \ "ArraysTotalElementCount"]$

Out[8]=

Obtain the layer type counts:

In[9]:=

$NetInformation[ NetModel["VGGish Feature Extractor Trained on YouTube Data"], \ "LayerTypeCounts"]$

Out[9]=

Display the summary graphic:

In[10]:=

Out[10]=

Export to MXNet

Export the net into a format that can be opened in MXNet:

In[11]:=

Out[11]=

Export also creates a net.params file containing parameters:

In[12]:=

Out[12]=

Get the size of the parameter file:

In[13]:=

Out[13]=

The size is similar to the byte count of the resource object:

In[14]:=

Out[14]=

Represent the MXNet net as a graph:

In[15]:=

Out[15]=

Construction Notebook

Download Construction Notebook

Open in Wolfram Cloud

Requirements

Wolfram Language 12.0 (April 2019) or above

Resource History

Date Created: 30 January 2019
Latest Update: 9 April 2019

Reference

S. Hershey, S. Chaudhuri, D. P. W. Ellis, J. F. Gemmeke, A. Jansen, R. C. Moore, M. Plakal, D. Platt, R. A. Saurous, B. Seybold, M. Slaney, R. J. Weiss, K. Wilson, "CNN Architectures for Large-Scale Audio Classification," arXiv:1609.09430 (2017)
Available from: https://github.com/tensorflow/models/tree/master/research/audioset
Rights: Apache 2.0 License