Wolfram Research

VGGish Feature Extractor Trained on YouTube Data

Represent sounds as a sequence of vectors

Released by Google in 2017, this model extracts 128-dimensional embeddings from ~1 second long audio signals. The model was trained on a large YouTube dataset (a preliminary version of what later became YouTube-8M).

Number of layers: 25 | Parameter count: 72,141,184 | Trained size: 289 MB

Training Set Information

Examples

Resource retrieval

Get the pre-trained net:

In[1]:=
NetModel["VGGish Feature Extractor Trained on YouTube Data"]
Out[1]=

Basic usage

Extract semantic features from an Audio object:

In[2]:=
(* Evaluate this cell to get the example input *) CloudGet["https://www.wolframcloud.com/obj/6d9aeea4-9091-40a6-b5a1-6c1697017878"]
Out[2]=

The extracted features are a sequence of 128-dimensional vectors of unsigned 8-bit integers. Visualize the relationship between the sounds using the network as a feature extractor:

In[3]:=
FeatureSpacePlot[
 Callout[ExampleData[#], #[[2]]] & /@ ExampleData["Audio"], 
 FeatureExtractor -> 
  NetModel["VGGish Feature Extractor Trained on YouTube Data"], 
 LabelingFunction -> None]
Out[3]=

The network output from the network itself is passed to a PCA transformation and a quantization step. To obtain the raw output of the net, remove the NetDecoder:

In[4]:=
rawFeaturesNet = 
 NetReplacePart[
  NetModel["VGGish Feature Extractor Trained on YouTube Data"], 
  "Output" -> None]
Out[4]=

Extract the raw features:

In[5]:=
(* Evaluate this cell to get the example input *) CloudGet["https://www.wolframcloud.com/obj/9907bbc4-4870-4033-857f-9bf181f7eed6"]
Out[5]=

Visualize the relationship between the sounds using the raw features as a feature extractor:

In[6]:=
FeatureSpacePlot[
 Callout[ExampleData[#], #[[2]]] & /@ ExampleData["Audio"], 
 FeatureExtractor -> rawFeaturesNet, LabelingFunction -> None]
Out[6]=

Net information

Inspect the number of parameters of all arrays in the net:

In[7]:=
NetInformation[
 NetModel["VGGish Feature Extractor Trained on YouTube Data"], \
"ArraysElementCounts"]
Out[7]=

Obtain the total number of parameters:

In[8]:=
NetInformation[
 NetModel["VGGish Feature Extractor Trained on YouTube Data"], \
"ArraysTotalElementCount"]
Out[8]=

Obtain the layer type counts:

In[9]:=
NetInformation[
 NetModel["VGGish Feature Extractor Trained on YouTube Data"], \
"LayerTypeCounts"]
Out[9]=

Display the summary graphic:

In[10]:=
NetInformation[
 NetModel["VGGish Feature Extractor Trained on YouTube Data"][[
  "Net"]], "SummaryGraphic"]
Out[10]=

Export to MXNet

Export the net into a format that can be opened in MXNet:

In[11]:=
jsonPath = 
 Export[FileNameJoin[{$TemporaryDirectory, "net.json"}], 
  NetModel["VGGish Feature Extractor Trained on YouTube Data"], 
  "MXNet"]
Out[11]=

Export also creates a net.params file containing parameters:

In[12]:=
paramPath = FileNameJoin[{DirectoryName[jsonPath], "net.params"}]
Out[12]=

Get the size of the parameter file:

In[13]:=
FileByteCount[paramPath]
Out[13]=

The size is similar to the byte count of the resource object:

In[14]:=
ResourceObject[
  "VGGish Feature Extractor Trained on YouTube Data"]["ByteCount"]
Out[14]=

Represent the MXNet net as a graph:

In[15]:=
Import[jsonPath, {"MXNet", "NodeGraphPlot"}]
Out[15]=

Requirements

Wolfram Language 12.0 (April 2019) or above

Resource History

Reference