Classify the ESC-50 Audio Dataset

Use transfer learning to retrain an audio classifier

Download the ESC-50 dataset, a labeled collection of 2000 environmental audio recordings:

In[1]:=

archive=URLDownload["https://github.com/karoldvl/ESC-50/archive/master.zip"];dir=FileNameJoin[{$TemporaryDirectory,"ESC50"}];ExtractArchive[archive,dir];

Import the metadata. The files are five-second-long recordings organized into 50 semantic classes:

In[2]:=

metaData=MapAssociationThread[rawMetadata〚1〛MapAt[File@FileNameJoin[{dir,"ESC-50-master","audio",#}]&,#,1]]&,Rest

raw metadata

;

Inspect a sample from the metadata:

In[3]:=

RandomChoice[metaData]//Dataset

Out[3]=

filename

File

/private/var/folders/x8/kdxwl4_955ndjlyd62s54gd00000gn/T/ESC50/ESC-50-master/audio/3-253081-A-2.wav



fold

target

category

pig

esc10

False

src_file

253081

take

Divide the dataset into training and testing subsets:

In[4]:=

{train,test}=TakeDrop[RandomSample[#filename#category&/@metaData],1600];

Take a look at the available classes:

In[5]:=

classes=DeleteDuplicates[train〚All,2〛];classes//Short

Out[5]//Short=

{crying_baby,insects,clock_alarm,sea_waves,42,thunderstorm,glass_breaking,door_wood_creaks,laughing}

Start with the original

AudioIdentify

network:

In[6]:=

net=NetModel["Wolfram AudioIdentify V1 Trained on AudioSet Data","Size""Small"];

Construct a feature extractor net by chopping the classifier layers and adding additional layers:

In[7]:=

mainNet=NetExtract[net,{1,"Net"}];featureExtractor=NetChain[{NetMapOperator[NetDrop[mainNet,-3]],AggregationLayer[Max,1],FlattenLayer[]},"Input"net〚"Input"〛]

Out[7]=

NetChain



Input port:	expression
Output port:	vector (size: 1280)

Data not saved. Save now



Construct a simple linear classifier network that will be attached to the feature extractor:

In[8]:=

classifier=NetChain[{DropoutLayer[.3],1024,Ramp,50,SoftmaxLayer[]},"Output"NetDecoder[{"Class",classes}]]

Out[8]=

NetChain

uniniti

alized

Input port:	array
Output port:	class



Instead of retraining the full net and specifying a

LearningRateMultipliers

option in

NetTrain

to train only the classification layers, you can precompute the results of the feature extractor net and train the classifier. This avoids redundant evaluation of the full net:

In[9]:=

train〚All,1〛=featureExtractor[train〚All,1〛];

Train the classifier network using

NetTrain

In[10]:=

trainedClassifier=NetTrain[classifier,train,ValidationSetScaled[.05],MaxTrainingRounds60]

Out[10]=

NetChain



Input port:	vector (size: 1280)
Output port:	class

Data not saved. Save now



Join the feature extractor network and the trained classifier using

NetJoin

In[11]:=

finalNet=NetJoin[featureExtractor,trainedClassifier]

Out[11]=

NetChain



Input port:	expression
Output port:	class

Data not saved. Save now



Using

ClassifierMeasurements

, compute the accuracy on the test data and plot the confusion matrix of the worst four classes:

In[12]:=

cm=ClassifierMeasurements[finalNet,test];cm["Accuracy"]

Out[12]=

0.915

In[13]:=

cm["ConfusionMatrixPlot"4]

Out[13]=

Publisher Information

Contributed by: Wolfram Staff

Wolfram Language Example Repository

Classify the ESC-50 Audio Dataset

See Also

Publisher Information

Classify the ESC-50 Audio Dataset

See Also

Related Symbols

Publisher Information