Wolfram AudioIdentify V1 Trained on AudioSet Data

Identify sounds in an audio signal

This model is also available through the built-in function AudioIdentify

Released in 2019 by Wolfram Research, this net is part of the back end for the AudioIdentify function in Wolfram Language 12.0. It was designed to achieve a good balance between classification accuracy, size and evaluation speed.

Number of models: 2

Training Set Information

AudioSet, consisting of an expanding ontology of 632 audio event classes and a collection of 2,084,320 human-labeled 10-second sound clips drawn from YouTube videos. The ontology covers a wide range of human and animal sounds, musical instruments and genres, and common everyday environmental sounds.

Examples

Download Example Notebook

Open in Wolfram Cloud

Resource retrieval

Get the pre-trained net:

In[1]:=

Out[1]=

NetModel parameters

This model consists of a family of individual nets, each identified by a specific parameter combination. Inspect the available parameters:

In[2]:=

Out[2]=

Pick a non-default model by specifying the parameters:

In[3]:=

Out[3]=

Pick a non-default untrained net:

In[4]:=

Out[4]=

Basic usage

Identify an Audio object:

In[5]:=

Out[5]=

The prediction is an Entity object, which can be queried:

In[6]:=

Out[6]=

Get a list of available properties of the predicted Entity:

In[7]:=

Out[7]=

Obtain the probabilities of the ten most likely entities predicted by the net:

In[8]:=

Out[8]=

The probabilities do not sum to 1 since the net was trained as a collection of independent binary classifiers, one per each class. This reflects the possibility of having multiple sound classes in a single recording.

The network was trained on the AudioSet dataset, where each audio signal is annotated with the sound classes/sources that are present in the recording. The labels are organized in an ontology of about 632 classes that span a very wide domain of sound types or sources, from musical instruments and music types to animal, mechanical and human sounds. Obtain the list of names of all available classes:

In[9]:=

EntityValue[
NetExtract[
NetModel["Wolfram AudioIdentify V1 Trained on AudioSet Data"], "Output"][["Labels"]], "Name"]

Out[9]=

Feature extraction

The core of the network takes a fixed-size chunk of the mel-spectrogram of the input signal and is mapped over overlapping chunks using NetMapOperator. Extract the core net:

In[10]:=

Out[10]=

Chop off the last few layers in charge of the classification:

In[11]:=

Out[11]=

This net takes a single chunk of the input signal and outputs a tensor of semantically meaningful features. Reconstruct the whole variable-length net using NetMapOperator to compute the features on each chunk and AggregationLayer to aggregate them over the time dimension:

In[12]:=

extractor = NetChain[{NetMapOperator[singleFrameFeatureExtractor], AggregationLayer[Max, 1], FlattenLayer[]}, "Input" -> NetModel["Wolfram AudioIdentify V1 Trained on AudioSet Data"][[
"Input"]]]

Out[12]=

Get a set of Audio objects:

In[13]:=

Visualize the features of a set of recordings:

In[14]:=

Out[14]=

Transfer learning

Use the pre-trained model to build a classifier for telling apart recordings of cows and birds. Create a test set and a training set:

In[15]:=

Remove the classification layers from the pre-trained net:

In[16]:=

featuresNet = NetChain[{NetMapOperator[
NetDrop[NetExtract[
NetModel[
"Wolfram AudioIdentify V1 Trained on AudioSet Data"], {1, "Net"}], -3]], AggregationLayer[Max, 1], FlattenLayer[]}, "Input" -> NetModel["Wolfram AudioIdentify V1 Trained on AudioSet Data"][[
"Input"]]]

Out[16]=

Create a classifier net using a simple LinearLayer:

In[17]:=

Out[17]=

Precompute the result of the feature net to avoid redundant evaluations. This is equivalent to freezing all the weights except for those in the new classifier net:

In[18]:=