Wolfram AudioIdentify V1 Trained on AudioSet Data

Identify sounds in an audio signal

Released in 2019 by Wolfram Research, this net is part of the back end for the AudioIdentify function in Wolfram Language 12.0. It was designed to achieve a good balance between classification accuracy, size and evaluation speed.

Number of models: 2

Training Set Information

Examples

Resource retrieval

Get the pre-trained net:

In[1]:=
NetModel["Wolfram AudioIdentify V1 Trained on AudioSet Data"]
Out[1]=

NetModel parameters

This model consists of a family of individual nets, each identified by a specific parameter combination. Inspect the available parameters:

In[2]:=
NetModel["Wolfram AudioIdentify V1 Trained on AudioSet Data", \
"ParametersInformation"]
Out[2]=

Pick a non-default model by specifying the parameters:

In[3]:=
NetModel[{"Wolfram AudioIdentify V1 Trained on AudioSet Data", "Size" -> "Small"}]
Out[3]=

Pick a non-default untrained net:

In[4]:=
NetModel[{"Wolfram AudioIdentify V1 Trained on AudioSet Data", "Size" -> "Large"}, "UninitializedEvaluationNet"]
Out[4]=

Basic usage

Identify an Audio object:

In[5]:=
(* Evaluate this cell to get the example input *) CloudGet["https://www.wolframcloud.com/obj/40ffce92-738c-4efe-bdbb-ca86d32c3285"]
Out[5]=

The prediction is an Entity object, which can be queried:

In[6]:=
pred["Description"]
Out[6]=

Get a list of available properties of the predicted Entity:

In[7]:=
pred["Properties"]
Out[7]=

Obtain the probabilities of the ten most likely entities predicted by the net:

In[8]:=
(* Evaluate this cell to get the example input *) CloudGet["https://www.wolframcloud.com/obj/349b67b3-91ba-4652-8870-1c8c09ba47a8"]
Out[8]=

The probabilities do not sum to 1 since the net was trained as a collection of independent binary classifiers, one per each class. This reflects the possibility of having multiple sound classes in a single recording.

The network was trained on the AudioSet dataset, where each audio signal is annotated with the sound classes/sources that are present in the recording. The labels are organized in an ontology of about 632 classes that span a very wide domain of sound types or sources, from musical instruments and music types to animal, mechanical and human sounds. Obtain the list of names of all available classes:

In[9]:=
EntityValue[
 NetExtract[
   NetModel["Wolfram AudioIdentify V1 Trained on AudioSet Data"], "Output"][["Labels"]], "Name"]
Out[9]=

Feature extraction

The core of the network takes a fixed-size chunk of the mel-spectrogram of the input signal and is mapped over overlapping chunks using NetMapOperator. Extract the core net:

In[10]:=
coreNet = NetExtract[
  NetModel["Wolfram AudioIdentify V1 Trained on AudioSet Data"], {1, "Net"}]
Out[10]=

Chop off the last few layers in charge of the classification:

In[11]:=
singleFrameFeatureExtractor = NetDrop[coreNet, -3]
Out[11]=

This net takes a single chunk of the input signal and outputs a tensor of semantically meaningful features. Reconstruct the whole variable-length net using NetMapOperator to compute the features on each chunk and AggregationLayer to aggregate them over the time dimension:

In[12]:=
extractor = NetChain[{NetMapOperator[singleFrameFeatureExtractor], AggregationLayer[Max, 1], FlattenLayer[]}, "Input" -> NetModel["Wolfram AudioIdentify V1 Trained on AudioSet Data"][[
    "Input"]]]
Out[12]=

Get a set of Audio objects:

In[13]:=
audios = Flatten[
   Thread[WebAudioSearch[#, "Samples", #Duration < 5 &, MaxItems -> 20] -> #] & /@ {"cow", "bird", "cat"}];

Visualize the features of a set of recordings:

In[14]:=
FeatureSpacePlot[audios, FeatureExtractor -> extractor]
Out[14]=

Transfer learning

Use the pre-trained model to build a classifier for telling apart recordings of cows and birds. Create a test set and a training set:

In[15]:=
SeedRandom[42]; {trainSet, testSet} = TakeDrop[RandomSample[
   Select[audios, MatchQ[#[[2]], "cow" | "bird"] &]], 30];

Remove the classification layers from the pre-trained net:

In[16]:=
featuresNet = NetChain[{NetMapOperator[
    NetDrop[NetExtract[
      NetModel[
       "Wolfram AudioIdentify V1 Trained on AudioSet Data"], {1, "Net"}], -3]], AggregationLayer[Max, 1], FlattenLayer[]}, "Input" -> NetModel["Wolfram AudioIdentify V1 Trained on AudioSet Data"][[
    "Input"]]]
Out[16]=

Create a classifier net using a simple LinearLayer:

In[17]:=
classifier = NetChain[{LinearLayer[2], SoftmaxLayer[]}, "Output" -> NetDecoder[{"Class", {"bird", "cow"}}]]
Out[17]=

Precompute the result of the feature net to avoid redundant evaluations. This is equivalent to freezing all the weights except for those in the new classifier net:

In[18]:=
trainSet[[All, 1]] = featuresNet[trainSet[[All, 1]]];

Train on the dataset (use TargetDevice -> "GPU" for training on a GPU):

In[19]:=
trainedNet = NetTrain[classifier, trainSet]
Out[19]=

Perfect accuracy is obtained on the test set:

In[20]:=
ClassifierMeasurements[
 NetJoin[featuresNet, trainedNet], testSet, "Report"]
Out[20]=

Net information

Inspect the number of parameters of all arrays in the net:

In[21]:=
NetInformation[
 NetModel["Wolfram AudioIdentify V1 Trained on AudioSet Data"], \
"ArraysElementCounts"]
Out[21]=

Obtain the total number of parameters:

In[22]:=
NetInformation[
 NetModel["Wolfram AudioIdentify V1 Trained on AudioSet Data"], \
"ArraysTotalElementCount"]
Out[22]=

Obtain the layer type counts:

In[23]:=
NetInformation[
 NetModel["Wolfram AudioIdentify V1 Trained on AudioSet Data"], \
"LayerTypeCounts"]
Out[23]=

Display the summary graphic:

In[24]:=
NetInformation[
 NetModel["Wolfram AudioIdentify V1 Trained on AudioSet Data"], \
"SummaryGraphic"]
Out[24]=

Export to MXNet

Export the net into a format that can be opened in MXNet:

In[25]:=
jsonPath = Export[FileNameJoin[{$TemporaryDirectory, "net.json"}], NetModel["Wolfram AudioIdentify V1 Trained on AudioSet Data"], "MXNet"]
Out[25]=

Export also creates a net.params file containing parameters:

In[26]:=
paramPath = FileNameJoin[{DirectoryName[jsonPath], "net.params"}]
Out[26]=

Get the size of the parameter file:

In[27]:=
FileByteCount[paramPath]
Out[27]=

Requirements

Wolfram Language 12.0 (April 2019) or above

Resource History

Reference