Wolfram Computation Meets Knowledge

CREPE Pitch Detection Net Trained on Monophonic Signal Data

Track the pitch of a monophonic signal

Released in 2018, CREPE is a state-of-the-art system based on a deep convolutional neural network that operates directly on the time-domain waveform. The architecture is based on a chain of six convolution stacks, followed by a classifier. The net outputs a vector that represents the probability of the pitch being in one of 360 frequency classes nonlinearly spaced.

Number of layers: 41 | Parameter count: 22,244,328 | Trained size: 89 MB

Training Set Information

Performance

Examples

Resource retrieval

Get the pre-trained net:

In[1]:=
NetModel["CREPE Pitch Detection Net Trained on Monophonic Signal \
Data"]
Out[1]=

Evaluation function

Define a Hidden Markov process that will be used for decoding the output of the net:

In[2]:=
HMP = Module[
   {starting, transition, emission, xx, yy, selfEmission}, 
   starting = ConstantArray[1./360, 360];
   yy = ConstantArray[Range[360], 360]; xx = Transpose@yy; 
   transition = Map[Max[#, 0.] &, 12 - Abs[xx - yy], {2}]; 
   transition = N@#/Total[#] & /@ transition; selfEmission = 0.1; 
   emission = 
    IdentityMatrix[360]*selfEmission + 
     ConstantArray[(1. - selfEmission)/360., {360, 360}]; 
   HiddenMarkovProcess[starting, transition, emission]
   ];

This net takes a monophonic audio signal and outputs an estimation of the pitch of the signal on a logarithmic pitch scale. Write an evaluation function to convert the result to a TimeSeries containing the predicted frequency in Hz and the confidence of the prediction:

In[3]:=
findPrediction["Interpolation", salience_, center_: None] /; 
  ArrayDepth[salience] == 2 := 
 Map[Function[in, findPrediction["Interpolation", in, center]], 
  salience]
In[4]:=
findPrediction["Interpolation", salience_, center_: None] :=
 Module[
  {c, a, endpoints, cents},
  If[center === None, c = First@Ordering[salience, -1], c = center];
  endpoints = { Max[1, c - 4], Min[Length@salience, c + 5]}; 
  a = Take[salience, endpoints];
  cents = Range[0, 7180, 20] + 1997.3794084376191;
  {10.*2^(Total[a*Take[cents, endpoints]]/Total[a] / 1200.), 
   salience[[c]]}
  ]
In[5]:=
findPrediction["Viterbi", salience_] := 
 MapThread[
  findPrediction["Interpolation", ##] &, {salience, 
   FindHiddenMarkovStates[First[Ordering[#, -1]] & /@ salience, HMP]}]
In[6]:=
netevaluation[a_?AudioQ, OptionsPattern[{"Decoder" -> "Viterbi"}]] := 
 Module[
  {res, times},
  res = NetModel[
     "CREPE Pitch Detection Net Trained on Monophonic Signal Data"][
    AudioPad[a, {0.032, 0.032}]];
  res = findPrediction[
    OptionValue["Decoder"] /. Except["Viterbi"] -> "Interpolation", 
    res];
  times = {Range[0., 
     QuantityMagnitude[Duration[a], "Seconds"], .01]};
  <|"Prediction" -> TimeSeries[res[[All, 1]], times], 
   "Confidence" -> TimeSeries[res[[All, 2]], times]|>
  ]

Basic usage

Detect the pitch of a monophonic signal:

In[7]:=
pred = netevaluation[ExampleData[{"Audio", "Cello"}]]
Out[7]=

Plot the predicted frequency with the confidence mapped to the opacity:

In[8]:=
ListLinePlot[pred["Prediction"], 
 ColorFunction -> 
  Function[{x, y}, Directive@Opacity@pred["Confidence"][x]], 
 ColorFunctionScaling -> False, PlotRange -> {100, 120}]
Out[8]=

Performance evaluation

Generate a signal using a sinusoidal oscillator:

In[9]:=
f[t_] := 400 + 200 Sin[2*Pi*t];
a = AudioGenerator[{"Sin", f}]
Out[10]=

Compare the frequency predicted by the net with the ground truth:

In[11]:=
ListLinePlot[<|"Ground Truth" -> Table[{t, f[t]}, {t, 0, 1, .01}], 
  "Prediction" -> netevaluation[a]["Prediction"]|>]
Out[11]=

Net information

Inspect the number of parameters of all arrays in the net:

In[12]:=
NetInformation[
 NetModel["CREPE Pitch Detection Net Trained on Monophonic Signal \
Data"], "ArraysElementCounts"]
Out[12]=

Obtain the total number of parameters:

In[13]:=
NetInformation[
 NetModel["CREPE Pitch Detection Net Trained on Monophonic Signal \
Data"], "ArraysTotalElementCount"]
Out[13]=

Obtain the layer type counts:

In[14]:=
NetInformation[
 NetModel["CREPE Pitch Detection Net Trained on Monophonic Signal \
Data"], "LayerTypeCounts"]
Out[14]=

Display the summary graphic for the main net:

In[15]:=
NetInformation[
 NetModel["CREPE Pitch Detection Net Trained on Monophonic Signal \
Data"][["Net"]], "SummaryGraphic"]
Out[15]=

Export to MXNet

Export the net into a format that can be opened in MXNet:

In[16]:=
jsonPath = 
 Export[FileNameJoin[{$TemporaryDirectory, "net.json"}], 
  NetModel["CREPE Pitch Detection Net Trained on Monophonic Signal \
Data"], "MXNet"]
Out[16]=

Export also creates a net.params file containing parameters:

In[17]:=
paramPath = FileNameJoin[{DirectoryName[jsonPath], "net.params"}]
Out[17]=

Get the size of the parameter file:

In[18]:=
FileByteCount[paramPath]
Out[18]=

The size is similar to the byte count of the resource object:

In[19]:=
ResourceObject[
  "CREPE Pitch Detection Net Trained on Monophonic Signal \
Data"]["ByteCount"]
Out[19]=

Represent the MXNet net as a graph:

In[20]:=
Import[jsonPath, {"MXNet", "NodeGraphPlot"}]
Out[20]=

Resource History

Reference