NetModel parameters
This model consists of a family of individual nets, each identified by a specific parameter combination. Inspect the available parameters:
Pick a non-default net by specifying the parameters:
Pick a non-default uninitialized net:
Get the labels:
Feature extraction
Get a set of audio samples for background noise and speech:
Visualize the feature space embedding performed by the audio encoder. Notice that the human speech samples and the background noise samples belong to different clusters:
Advanced usage
Set the option "IncludeTimestamps" to True to add timestamps at the beginning and end of the audio sample:
Perform transcription with a different "Temperature":
The option "SuppressSpecialTokens" removes non-speech tokens. Compare the transcription of the original audio sample with the sample after "SuppressSpecialTokens" is enabled:
Whisper can recognize actions or background sounds in an audio sample:
Transcription
The transcription pipeline makes use of two separate transformer nets, encoder and decoder:
The encoder preprocesses input audio into a log-Mel spectrogram, capturing the signal's frequency content over time:
Get an input audio sample and compute its log-Mel spectrogram:
Visualize the log-Mel spectrogram and the audio waveform:
The encoder processes the input once, producing a feature matrix of size 1500x512:
The decoding step involves running the decoder multiple times recursively, with each iteration producing a subword token of the transcribed audio. The decoder receives several inputs:
• The port "Input1" takes the subword token generated by the previous evaluation of the decoder.
• The port "Index" takes an integer keeping count of how many times the decoder was evaluated (positional encoding).
• The port "Input2" takes the encoded features produced by the encoder. The data fed to this input is the same for every evaluation of the decoder.
• The ports "State1", "State2"... take the self-attention key and value arrays for all the past tokens. Their size grows by one at each evaluation. The default ("Size"->"Base") decoder has 12 attention blocks, which makes for 24 states: 12 key arrays and 12 value arrays.
Before starting the decoding process, initialize the decoder's inputs:
Use the decoder iteratively to transcribe the audio sample. The recursion keeps going until the EndOfString token is generated or the maximum number of iterations is reached:
Display the generated tokens:
Obtain a readable representation of the tokens:
Net information
Inspect the number of parameters of all arrays in the net:
Obtain the total number of parameters:
Obtain the layer type counts:
Display the summary graphic: