Classify Spoken Digits

Use the neural net framework to enable powerful and user-friendly network training tools for audio objects

Retrieve the Spoken Digit Commands datasets from the Wolfram Data Repository:

In[1]:=

ro=ResourceObject["Spoken Digit Commands"]

Out[1]=

ResourceObject

Name: Spoken Digit Commands »
Type: DataResource
Description: A dataset consisting of recordings of spoken digits



The dataset is comprised of recordings of the digits from 0 to 9. It is essentially an audio equivalent to the MNIST digit dataset:

In[2]:=

trainingData=ResourceData[ro,"TrainingData"];testingData=ResourceData[ro,"TestData"];RandomSample[trainingData,3]//Dataset

Out[2]=

You can start by deciding how a recording will be transformed into something that a neural network can use. The

"AudioMFCC"

net encoder is used, where the signal is split into overlapping partitions and some processing is applied to each to reduce the dimension while preserving information that is important for understanding the signal:

In[3]:=

encoder=NetEncoder"AudioMFCC",

params

;encoder[RandomChoice[trainingData]["Input"]]//MatrixPlot

Out[3]=

The network will be based on a simple

NetChain

GatedRecurrentLayer

s. Since you are interested in a single classification, the recurrent layers are followed by a

SequenceLastLayer

and a linear classifier:

In[4]:=

rnn=NetChain[{GatedRecurrentLayer[32,"Dropout"{"VariationalInput"0.3}],GatedRecurrentLayer[64,"Dropout"{"VariationalInput"0.3}],SequenceLastLayer[],LinearLayer[64],Ramp,LinearLayer[],SoftmaxLayer[]},"Input"encoder,"Output"NetDecoder[{"Class",Range[0,9]}]]

Out[4]=

NetChain



uniniti

aliz

Input port:	audio mfcc
Output port:	class
Number of layers:	7



You can train the net, letting

NetTrain

worry about all hyperparameters:

In[5]:=

resultObjectRNN=NetTrain[rnn,trainingData,All,ValidationSetScaled[.05]]

Out[5]=

Compute the performance of the net using

NetMeasurements

In[6]:=

NetMeasurements[resultObjectRNN["TrainedNet"],testingData,{"Accuracy","ConfusionMatrixPlot"}]//Column

Out[6]=

By removing the last classification layers, you can obtain a feature extractor for audio signals:

In[7]:=

featureExtractor=NetTake[resultObjectRNN["TrainedNet"],{1,6}]

Out[7]=

NetChain



Input port:	audio mfcc
Output port:	vector (size: 10)
Number of layers:	6



Use

FeatureSpacePlot

to to visualize the test dataset embedded in a feature space defined by the net you trained:

In[8]:=

styleInputs[data_]:=With[{colors=AssociationMap[ColorData[97][#]&,Range[0,9]]},Callout[Style[#Input,colors[#Output]],#Output]&/@data]

In[9]:=

FeatureSpacePlot[styleInputs[testingData],FeatureExtractorfeatureExtractor,LabelingFunctionNone]

Out[9]=

Publisher Information

Contributed by: Wolfram Staff

Classify Spoken Digits

Related Symbols

Publisher Information