VGGish Feature Extractor
Trained on
YouTube Data
Released by Google in 2017, this model extracts 128-dimensional embeddings from ~1 second long audio signals. The model was trained on a large YouTube dataset (a preliminary version of what later became YouTube-8M).
Number of layers: 25 |
Parameter count: 72,141,184 |
Trained size: 289 MB |
Resource retrieval
Get the pre-trained net:
Basic usage
Extract semantic features from an Audio object:
The extracted features are a sequence of 128-dimensional vectors of unsigned 8-bit integers. Visualize the relationship between the sounds using the network as a feature extractor:
The network output from the network itself is passed to a PCA transformation and a quantization step. To obtain the raw output of the net, remove the NetDecoder:
Extract the raw features:
Visualize the relationship between the sounds using the raw features as a feature extractor:
Net information
Inspect the number of parameters of all arrays in the net:
Obtain the total number of parameters:
Obtain the layer type counts:
Display the summary graphic:
Export to MXNet
Export the net into a format that can be opened in MXNet:
Export also creates a net.params file containing parameters:
Get the size of the parameter file:
The size is similar to the byte count of the resource object:
Represent the MXNet net as a graph:
Wolfram Language
(April 2019)
or above
Resource History
S. Hershey, S. Chaudhuri, D. P. W. Ellis, J. F. Gemmeke, A. Jansen, R. C. Moore, M. Plakal, D. Platt, R. A. Saurous, B. Seybold, M. Slaney, R. J. Weiss, K. Wilson, "CNN Architectures for Large-Scale Audio Classification," arXiv:1609.09430 (2017)
- Available from:
Apache 2.0 License