Transcribe an English speech audio recording

Released in 2017, Baidu Research's Deep Speech 2 model converts speech to text end-to-end from a normalized sound spectrogram to the sequence of characters. It consists on a few convolution layers over both time and frequencies, followed by gated recurrent unit (GRU) layers (modified with an additional batch normalization). At evaluation time, the space of possible output sequences is explored by the decoder using a beam search algorithm. The same architecture has also been shown to train successfully on Mandarin Chinese.

Number of layers: 42 | Parameter count: 22,244,328 | Trained size: 89 MB |

Training Set Information



Wolfram Language 11.3 (March 2018) or above

Resource History