CREPE Pitch Detection Net Trained on Monophonic Signal Data

Track the pitch of a monophonic signal

This model is also available through the built-in function PitchRecognize

Released in 2018, CREPE is a state-of-the-art system based on a deep convolutional neural network that operates directly on the time-domain waveform. The architecture is based on a chain of six convolution stacks, followed by a classifier. The net outputs a vector that represents the probability of the pitch being in one of 360 frequency classes nonlinearly spaced.

Number of layers: 41 | Parameter count: 22,244,328 | Trained size: 89 MB |

Training Set Information

MIR-1K, consisting of 1,000 song clips annotated with the pitch contours of the melodies. Bach10, consisting of recordings of 10 Bach chorales. RWC-Synth, an unreleased dataset that contains 6.16 hours of audio synthesized from the RWC Music Database. MedleyDB, a collection of songs including melody f0 annotations as well as instrument activations for evaluating automatic instrument recognition. NSynth, an audio dataset containing 305,979 musical notes, each with a unique pitch, timbre and envelope. MDB-STEM-Synth, an unreleased collection of 230 monophonic stems taken from MedleyDB and resynthesized with an analysis/synthesis approach to generate a synthesized track with a perfect f0 annotation that maintains the timbre and dynamics of the original track. This dataset consists of 230 tracks with 25 instruments, totaling 15.56 hours of audio.

Performance

This model achieves, respectively, 0.999, 0.999, 0.995 average RPA (raw pitch accuracy) at the 50-cent, 25-cent and 10-cent thresholds, and 0.999 average RCA (raw chroma accuracy) at the 50-cent threshold on the RWC-Synth dataset; and, respectively, 0.967, 0.953, 0.909 average RPA (raw pitch accuracy) at the 50-cent, 25-cent and 10-cent thresholds, and 0.970 average RCA (raw chroma accuracy) at the 50-cent threshold on the MDB-STEM-Synth dataset.

Examples

Download Example Notebook

Open in Wolfram Cloud

Resource retrieval

Get the pre-trained net:

In[1]:=

Out[1]=

Evaluation function

Define a Hidden Markov process that will be used for decoding the output of the net:

In[2]:=

HMP = Module[
{starting, transition, emission, xx, yy, selfEmission}, starting = ConstantArray[1./360, 360];
yy = ConstantArray[Range[360], 360]; xx = Transpose@yy; transition = Map[Max[#, 0.] &, 12 - Abs[xx - yy], {2}]; transition = N@#/Total[#] & /@ transition; selfEmission = 0.1; emission = IdentityMatrix[360]*selfEmission + ConstantArray[(1. - selfEmission)/360., {360, 360}]; HiddenMarkovProcess[starting, transition, emission]
];

This net takes a monophonic audio signal and outputs an estimation of the pitch of the signal on a logarithmic pitch scale. Write an evaluation function to convert the result to a TimeSeries containing the predicted frequency in Hz and the confidence of the prediction:

In[3]:=

In[4]:=

$findPrediction["Interpolation", salience_, center_ : None] := Module[ {c, a, endpoints, cents}, If[center === None, c = First@Ordering[salience, -1], c = center]; endpoints = { Max[1, c - 4], Min[Length@salience, c + 5]}; a = Take[salience, endpoints]; cents = Range[0, 7180, 20] + 1997.3794084376191; {10.*2^(Total[a*Take[cents, endpoints]]/Total[a] / 1200.), salience[[c]]} ]$

In[5]:=

$findPrediction["Viterbi", salience_] := MapThread[ findPrediction["Interpolation", ##] &, {salience, FindHiddenMarkovStates[First[Ordering[#, -1]] & /@ salience, HMP]}]$

In[6]:=

$netevaluation[a_?AudioQ, OptionsPattern[{"Decoder" -> "Viterbi"}]] := Module[ {res, times}, res = NetModel[ "CREPE Pitch Detection Net Trained on Monophonic Signal Data"][ AudioPad[a, {0.032, 0.032}]]; res = findPrediction[ OptionValue["Decoder"] /. Except["Viterbi"] -> "Interpolation", res]; times = {Range[0., QuantityMagnitude[Duration[a], "Seconds"], .01]}; <|"Prediction" -> TimeSeries[res[[All, 1]], times], "Confidence" -> TimeSeries[res[[All, 2]], times]|> ]$

Basic usage

Detect the pitch of a monophonic signal:

In[7]:=

Out[7]=

Plot the predicted frequency with the confidence mapped to the opacity:

In[8]:=

ListLinePlot[pred["Prediction"], ColorFunction -> Function[{x, y}, Directive@Opacity@pred["Confidence"][x]], ColorFunctionScaling -> False, PlotRange -> {100, 120}]

Out[8]=

Performance evaluation

Generate a signal using a sinusoidal oscillator:

In[9]:=

$f[t_] := 400 + 200 Sin[2*Pi*t]; a = AudioGenerator[{"Sin", f}]$

Out[10]=

Compare the frequency predicted by the net with the ground truth:

In[11]:=

Out[11]=

Net information

Inspect the number of parameters of all arrays in the net:

In[12]:=

Out[12]=

Obtain the total number of parameters:

In[13]:=

Out[13]=

Obtain the layer type counts:

In[14]:=

Out[14]=

Display the summary graphic for the main net:

In[15]:=

Out[15]=

Export to MXNet

Export the net into a format that can be opened in MXNet:

In[16]:=

Out[16]=

Export also creates a net.params file containing parameters:

In[17]:=

Out[17]=

Get the size of the parameter file:

In[18]:=

Out[18]=

The size is similar to the byte count of the resource object:

In[19]:=

Out[19]=

Represent the MXNet net as a graph:

In[20]:=

Out[20]=

Construction Notebook

Download Construction Notebook

Open in Wolfram Cloud

Resource History

Date Created: 16 May 2018
Latest Update: 14 April 2019

Reference

J. W. Kim, J. Salamon, P. Li, J. P. Bello, "CREPE: A Convolutional Representation for Pitch Estimation," arXiv:1802.06182 (2018)
Available from: https://github.com/marl/crepe
Rights: MIT License