Resource retrieval
Get the pre-trained net:
NetModel parameters
This model consists of a family of individual nets, each identified by a specific parameter combination. Inspect the available parameters:
Pick a non-default net by specifying the parameters:
Pick a non-default uninitialized net:
Get the labels:
Evaluation function
Write an evaluation function to combine the encoder and decoder nets into a full transcription pipeline:
Basic usage
Transcribe speech in English:
Transcribe speech in Spanish:
Whisper can detect the audio language sample automatically, but the "Language" option can be used to pre-define the language of the audio sample. Transcribe speech in Japanese:
Set the option "IncludeTimestamps" to True to add timestamps at the beginning and end of the audio sample:
Feature extraction
Get a set of audio samples with human speech and background noise:
Define a feature extraction using the Whisper encoder:
Visualize the feature space embedding performed by the audio encoder. Notice that the audio samples from the same class are clustered together:
Get a set of English and German audio samples:
Visualize the feature space embedding and observe how the English and German audio samples are clustered together:
Language identification
Whisper can transcribe from one hundred different languages. Retrieve the list of available languages from the label set:
Obtain a collection of audio samples featuring speakers of different languages:
Define a function to detect the language of the audio sample. Whisper determines the language by selecting the most likely language token after the initial pass of the decoder (the following code needs definitions from the "Evaluation function" section):
Detect the languages:
Transcribe the audio samples:
Transcription generation
The transcription pipeline makes use of two separate transformer nets, encoder and decoder:
The encoder preprocesses the input audio into a log-Mel spectrogram, capturing the signal's frequency content over time:
Get an input audio sample and compute its log-Mel spectrogram:
Visualize the log-Mel spectrogram and the audio waveform:
The encoder processes the input once, producing a feature matrix of size 1500x1280:
The decoding step involves running the decoder multiple times recursively, with each iteration producing a subword token of the transcribed audio. The decoder receives several inputs:
• The port "Input1" takes the subword token generated by the previous evaluation of the decoder.
• The port "Index" takes an integer keeping count of how many times the decoder was evaluated (positional encoding).
• The port "Input2" takes the encoded features produced by the encoder. The data fed to this input is the same for every evaluation of the decoder.
• The ports "State1", "State2"... take the self-attention key and value arrays for all the past tokens. Their size grows by one at each evaluation. The decoder has four attention blocks, which makes for eight states: four key arrays and four value arrays.
The initial prompt is a sequence of context tokens that guides Whisper's decoding process by specifying the task to perform and the audio's language. These tokens can be hard-coded to explicitly control the output or left flexible, allowing the model to automatically detect the language and task. Define the initial prompt for transcribing audio in Spanish:
Retrieve the integer codes of the prompt tokens:
Before starting the decoding process, initialize the decoder's inputs:
Use the decoder iteratively to transcribe the audio. The recursion keeps going until the EndOfString token is generated or the maximum number of iterations is reached:
Display the generated tokens:
Obtain a readable representation of the tokens by converting the text into UTF8:
Net information
Inspect the number of parameters of all arrays in the net:
Obtain the total number of parameters:
Obtain the layer type counts:
Display the summary graphic: