HiddenPitchDetectRaw

Transcribe an English speech audio recording

Released in 2017, Baidu Research's Deep Speech 2 model converts speech to text end-to-end from a normalized sound spectrogram to the sequence of characters. It consists on a few convolution layers over both time and frequencies, followed by gated recurrent unit (GRU) layers (modified with an additional batch normalization). At evaluation time, the space of possible output sequences is explored by the decoder using a beam search algorithm. The same architecture has also been shown to train successfully on Mandarin Chinese.

Number of layers: 42 | Parameter count: 22,244,328 | Trained size: 89 MB |

Training Set Information

An unreleased dataset of 11,940 hours of labeled speech assembled from publicly available datasets and Baidu's internal data.

Examples

Construction Notebook

Download Construction Notebook

Open in Wolfram Cloud

Requirements

Wolfram Language 11.3 (March 2018) or above

Resource History

Date Created: 9 May 2018

Reference

D. Amodei et al., "Deep Speech 2: End-to-End Speech Recognition in English and Mandarin", arXiv:1512.02595
Available from: https://github.com/PaddlePaddle/DeepSpeech
Rights: Apache 2.0 License