Wolfram AudioIdentify V1 Trained on AudioSet Data

Identify sounds in an audio signal

This model is also available through the built-in function AudioIdentify

Released in 2019 by Wolfram Research, this net is part of the back end for the AudioIdentify function in Wolfram Language 12.0. It was designed to achieve a good balance between classification accuracy, size and evaluation speed.

Number of models: 2

Training Set Information

Examples

Resource retrieval

Get the pre-trained net:

In[1]:=
NetModel["Wolfram AudioIdentify V1 Trained on AudioSet Data"]
Out[1]=

NetModel parameters

This model consists of a family of individual nets, each identified by a specific parameter combination. Inspect the available parameters:

In[2]:=
NetModel["Wolfram AudioIdentify V1 Trained on AudioSet Data", "ParametersInformation"]
Out[2]=

Pick a non-default model by specifying the parameters:

In[3]:=
NetModel[{"Wolfram AudioIdentify V1 Trained on AudioSet Data", "Size" -> "Small"}]
Out[3]=

Pick a non-default untrained net:

In[4]:=
NetModel[{"Wolfram AudioIdentify V1 Trained on AudioSet Data", "Size" -> "Large"}, "UninitializedEvaluationNet"]
Out[4]=

Basic usage

Identify an Audio object:

In[5]:=
(* Evaluate this cell to get the example input *) CloudGet["https://www.wolframcloud.com/obj/81f4e557-692d-4506-af1f-faadb466b688"]
Out[5]=

The prediction is an Entity object, which can be queried:

In[6]:=
pred["Description"]
Out[6]=

Get a list of available properties of the predicted Entity:

In[7]:=
pred["Properties"]
Out[7]=

Obtain the probabilities of the ten most likely entities predicted by the net:

In[8]:=
(* Evaluate this cell to get the example input *) CloudGet["https://www.wolframcloud.com/obj/4671df0a-fc73-4236-b556-78488e72ca9b"]
Out[8]=

The probabilities do not sum to 1 since the net was trained as a collection of independent binary classifiers, one per each class. This reflects the possibility of having multiple sound classes in a single recording.

The network was trained on the AudioSet dataset, where each audio signal is annotated with the sound classes/sources that are present in the recording. The labels are organized in an ontology of about 632 classes that span a very wide domain of sound types or sources, from musical instruments and music types to animal, mechanical and human sounds. Obtain the list of names of all available classes:

In[9]:=
EntityValue[
 NetExtract[
   NetModel["Wolfram AudioIdentify V1 Trained on AudioSet Data"], "Output"][["Labels"]], "Name"]
Out[9]=

Feature extraction

The core of the network takes a fixed-size chunk of the mel-spectrogram of the input signal and is mapped over overlapping chunks using NetMapOperator. Extract the core net:

In[10]:=
coreNet = NetExtract[
  NetModel["Wolfram AudioIdentify V1 Trained on AudioSet Data"], {1, "Net"}]
Out[10]=

Chop off the last few layers in charge of the classification:

In[11]:=
singleFrameFeatureExtractor = NetDrop[coreNet, -3]
Out[11]=

This net takes a single chunk of the input signal and outputs a tensor of semantically meaningful features. Reconstruct the whole variable-length net using NetMapOperator to compute the features on each chunk and AggregationLayer to aggregate them over the time dimension:

In[12]:=
extractor = NetChain[{NetMapOperator[singleFrameFeatureExtractor], AggregationLayer[Max, 1], FlattenLayer[]}, "Input" -> NetModel["Wolfram AudioIdentify V1 Trained on AudioSet Data"][[
    "Input"]]]
Out[12]=

Get a set of Audio objects:

In[13]:=
audios = Flatten[Thread[
      WebAudioSearch[#, "Samples", #Duration < 5 &, MaxItems -> 20] -> #] & /@ {"cow", "bird", "cat"}];

Visualize the features of a set of recordings:

In[14]:=
FeatureSpacePlot[audios, FeatureExtractor -> extractor]
Out[14]=

Transfer learning

Use the pre-trained model to build a classifier for telling apart recordings of cows and birds. Create a test set and a training set:

In[15]:=
SeedRandom[42]; {trainSet, testSet} = TakeDrop[
  RandomSample[Select[audios, MatchQ[#[[2]], "cow" | "bird"] &]], 30];

Remove the classification layers from the pre-trained net:

In[16]:=
featuresNet = NetChain[{NetMapOperator[
    NetDrop[NetExtract[
      NetModel[
       "Wolfram AudioIdentify V1 Trained on AudioSet Data"], {1, "Net"}], -3]], AggregationLayer[Max, 1], FlattenLayer[]}, "Input" -> NetModel["Wolfram AudioIdentify V1 Trained on AudioSet Data"][[
    "Input"]]]
Out[16]=

Create a classifier net using a simple LinearLayer:

In[17]:=
classifier = NetChain[{LinearLayer[2], SoftmaxLayer[]}, "Output" -> NetDecoder[{"Class", {"bird", "cow"}}]]
Out[17]=

Precompute the result of the feature net to avoid redundant evaluations. This is equivalent to freezing all the weights except for those in the new classifier net:

In[18]:=
trainSet[[All, 1]] = featuresNet[trainSet[[All, 1]]];

Train on the dataset (use TargetDevice -> "GPU" for training on a GPU):

In[19]:=
trainedNet = NetTrain[classifier, trainSet]
Out[19]=

Perfect accuracy is obtained on the test set:

In[20]:=
ClassifierMeasurements[
 NetJoin[featuresNet, trainedNet], testSet, "Report"]
Out[20]=

Net information

Inspect the number of parameters of all arrays in the net:

In[21]:=
NetInformation[
 NetModel[
  "Wolfram AudioIdentify V1 Trained on AudioSet Data"], "ArraysElementCounts"]
Out[21]=

Obtain the total number of parameters:

In[22]:=
NetInformation[
 NetModel[
  "Wolfram AudioIdentify V1 Trained on AudioSet Data"], "ArraysTotalElementCount"]
Out[22]=

Obtain the layer type counts:

In[23]:=
NetInformation[
 NetModel[
  "Wolfram AudioIdentify V1 Trained on AudioSet Data"], "LayerTypeCounts"]
Out[23]=

Display the summary graphic:

In[24]:=
NetInformation[
 NetModel[
  "Wolfram AudioIdentify V1 Trained on AudioSet Data"], "SummaryGraphic"]
Out[24]=

Export to MXNet

Export the net into a format that can be opened in MXNet:

In[25]:=
jsonPath = Export[FileNameJoin[{$TemporaryDirectory, "net.json"}], NetModel["Wolfram AudioIdentify V1 Trained on AudioSet Data"], "MXNet"]
Out[25]=

Export also creates a net.params file containing parameters:

In[26]:=
paramPath = FileNameJoin[{DirectoryName[jsonPath], "net.params"}]
Out[26]=

Get the size of the parameter file:

In[27]:=
FileByteCount[paramPath]
Out[27]=

Requirements

Wolfram Language 12.0 (April 2019) or above

Resource History

Reference