Language identification
Whisper can transcribe and translate audio from 99 languages, with Whisper Large adding support for Cantonese. Retrieve the list of available languages from the label set:
Obtain a collection of audio samples featuring speakers of different languages:
Define a function to detect the language of the audio sample. Whisper determines the language by selecting the most likely language token after the initial pass of the decoder (the following code needs definitions from the "Evaluation function" section):
Detect the languages:
Transcribe and translate the audio samples:
Transcription and Translation generation
The translation pipeline makes use of two separate transformer nets, encoder and decoder:
The encoder preprocesses the input audio into a log-Mel spectrogram, capturing the signal's frequency content over time:
Get an input audio sample and compute its log-Mel spectrogram:
Visualize the log-Mel spectrogram and the audio waveform:
The encoder processes the input once, producing a feature matrix of size 1500x768:
The decoding step involves running the decoder multiple times recursively, with each iteration producing a subword token of the translated or transcribed audio. The decoder receives several inputs:
• The port "Input1" takes the subword token generated by the previous evaluation of the decoder.
• The port "Index" takes an integer keeping count of how many times the decoder was evaluated (positional encoding).
• The port "Input2" takes the encoded features produced by the encoder. The data fed to this input is the same for every evaluation of the decoder.
• The ports "State1", "State2"... take the self-attention key and value arrays for all the past tokens. Their size grows by one at each evaluation. The default ("Size" -> "Small") decoder has 12 attention blocks, which makes for 24 states: 12 key arrays and 12 value arrays.
The initial prompt for the decoder is a sequence of context tokens that guides Whisper's decoding process by specifying the task to perform and the audio's language. These tokens can be hard-coded to explicitly control the output or left flexible, allowing the model to automatically detect the language and task. Define the initial prompt for transcribing audio in Spanish:
Retrieve the integer codes of the prompt tokens:
Before starting the decoding process, initialize the decoder's inputs:
Use the decoder iteratively to transcribe the audio. The recursion keeps going until the EndOfString token is generated or the maximum number of iterations is reached:
Display the generated tokens:
Obtain a readable representation of the tokens by converting the text into UTF8:
Change the task type to translate by assigning the third element in the prompt list to "|Translate|":
Generate again based on the new prompt:
Display the generated tokens:
Obtain a readable representation of the tokens: