Wolfram Neural Net Repository
Immediate Computable Access to Neural Net Models
Transcribe an English speech audio recording
Released in 2017, Baidu Research's Deep Speech 2 model converts speech to text end-to-end from a normalized sound spectrogram to the sequence of characters. It consists on a few convolution layers over both time and frequencies, followed by gated recurrent unit (GRU) layers (modified with an additional batch normalization). At evaluation time, the space of possible output sequences is explored by the decoder using a beam search algorithm. The same architecture has also been shown to train successfully on Mandarin Chinese.
Number of layers: 42 | Parameter count: 22,244,328 | Trained size: 89 MB |
Wolfram Language 11.3 (March 2018) or above