Whisper-V1 Multilingual Nets

Translate multiple-language audio recordings to English

The Whisper OpenAI family of multilingual speech recognition models are built to handle diverse languages with near-human accuracy. With training on 680,000 hours of data across multiple languages, Whisper is optimized for transcription and translation, effortlessly managing accents, background noise and technical jargon.

Model Information

Examples

Resource retrieval

Get the pre-trained net:

In[1]:=
NetModel["Whisper-V1 Multilingual Nets"]
Out[1]=

NetModel parameters

This model consists of a family of individual nets, each identified by a specific parameter combination. Inspect the available parameters:

In[2]:=
NetModel["Whisper-V1 Multilingual Nets", "ParametersInformation"]
Out[2]=

Pick a non-default net by specifying the parameters:

In[3]:=
NetModel[{"Whisper-V1 Multilingual Nets", "Size" -> "Tiny", "Part" -> "TextDecoder"}]
Out[3]=

Pick a non-default uninitialized net:

In[4]:=
NetModel[{"Whisper-V1 Multilingual Nets", "Size" -> "Large", "Part" -> "TextDecoder"}, "UninitializedEvaluationNet"]
Out[4]=

Get the labels:

In[5]:=
NetModel["Whisper-V1 Multilingual Nets", "Labels"]
Out[5]=

Evaluation function

Write an evaluation function to combine the encoder and decoder nets into a full translation pipeline:

In[6]:=
suppress[logits_, tokenIds_ : {}] := ReplacePart[logits, Thread[tokenIds -> -Infinity]];
rescore[logits_, temp_ : 1] := Block[{expRescaledLog, total}, expRescaledLog = Quiet[Exp[logits/temp]] /. Indeterminate -> 0.;
  total = Total[expRescaledLog, {-1}] /. 0. -> 1.;
  expRescaledLog/total]
sample[probs_, temp_, tokenIds_ : {}] := Block[{weights, suppressLogits}, suppressLogits = suppress[Log[probs], tokenIds];
   weights = Quiet@rescore[suppressLogits, temp];
   First@
    If[Max[weights] > 0, RandomSample[weights -> Range@Length@weights, 1], FirstPosition[#, Max[#]] &@
      Exp[suppressLogits](*low temperature cases*)]];
sample[probs_, 0., tokenIds_ : {}] := First@FirstPosition[#, Max[#]] &@suppress[probs, tokenIds];
sample[probs_, 0, tokenIds_ : {}] := sample[probs, 0., tokenIds];
In[7]:=
(* Evaluate this cell to get the example input *) CloudGet["https://www.wolframcloud.com/obj/6d54b0ee-2c2d-44b6-97bd-414dc039d6b4"]
In[8]:=
(*Define needsFallbackQ function*)
  fallbackQ[noSpeech_, compressionRatio_, compressionRatioThresh_, avgLogProb_, logProbThresh_] := Which[
   And[noSpeech, avgLogProb < logProbThresh], False,(*Silent*)
   avgLogProb < logProbThresh, True,(*average log probability is too low*)
   compressionRatio > compressionRatioThresh, True (*too repetitive*),
   True, False
   ];
(*Define compressionRatioF function*)
compressionRatio[tokens_, labels_] := With[
   {textBytes = StringToByteArray[
      StringJoin@
       FromCharacterCode[
        Flatten@ToCharacterCode[labels[[tokens]], "Unicode"], "UTF8"],
       "UTF-8"]},
   N@Length[textBytes]/StringLength[Compress[textBytes]]
   ];
(*Define decodeWithFallback function*)
Options[decodeWithFallback] = {"Language" -> Automatic, "Task" -> "Transcribe", "IncludeTimestamps" -> False, "SuppressSpecialTokens" -> False, "LogProbabilityThreshold" -> -1, "CompressionRatioThreshold" -> 7.2, "Temperature" -> 0, MaxIterations -> 224, TargetDevice -> "CPU"};
decodeWithFallback[features_, textDecoder_, initStates_, outPorts_, labels_, prev_, opts : OptionsPattern[]] := Module[{tokens, noSpeech, avgLogProb, compressRatio, outPortst, needsFallback = True, temperatures, i = 1},
   temperatures = Range[OptionValue["Temperature"], 1, 0.2];
   (*if needsFallback is True iterate over different temperatures*)
   While[i <= Length[temperatures],
    {tokens, noSpeech, avgLogProb} = generate[features, prev, textDecoder, initStates, outPorts, labels, "Language" -> OptionValue["Language"], "Task" -> OptionValue["Task" ], "IncludeTimestamps" -> OptionValue["IncludeTimestamps"], "SuppressSpecialTokens" -> OptionValue["SuppressSpecialTokens"],
       "Temperature" -> temperatures[[i]], MaxIterations -> OptionValue[MaxIterations], TargetDevice -> OptionValue[TargetDevice]];
    (*update iterator*)
    i++;
    (*update needsFallback*)
    compressRatio = compressionRatio[tokens, labels];
    needsFallback = fallbackQ[noSpeech, compressRatio, OptionValue["CompressionRatioThreshold"], avgLogProb, OptionValue["LogProbabilityThreshold"]];
    If[! needsFallback, Break[]];
    ];
   tokens (*return the generated tokens for this chunk*)
   ];
In[9]:=
(* Evaluate this cell to get the example input *) CloudGet["https://www.wolframcloud.com/obj/5561357e-a130-42f7-ae32-cfafd6ec1c16"]

Basic usage

Transcribe speech in different languages:

In[10]:=
audioEnglish = ResourceData["Sample Audio: Apollo 11 One Small Step"]
Out[10]=
In[11]:=
netevaluate[audioEnglish]
Out[11]=
In[12]:=
(* Evaluate this cell to get the example input *) CloudGet["https://www.wolframcloud.com/obj/7d204764-c73a-4f81-81da-32f531844ba7"]
In[13]:=
netevaluate[audioSpanish]
Out[13]=

Translate a piece of audio from Spanish to English:

In[14]:=
netevaluate[audioSpanish, "Task" -> "Translate"]
Out[14]=

Whisper can detect the audio language automatically, but the "Language" option can be used to pre-define the language of the audio:

In[15]:=
netevaluate[audioSpanish, "Language" -> "Spanish"]
Out[15]=

Feature extraction

Get a set of English and German audio samples:

In[16]:=
(* Evaluate this cell to get the example input *) CloudGet["https://www.wolframcloud.com/obj/f2178a92-b898-48ec-852e-0876a90eaa4e"]

Define a feature extraction using the Whisper encoder:

In[17]:=
extractor = NetChain[{NetModel["Whisper-V1 Multilingual Nets"], AggregationLayer[Max, 1]}][AudioPartition[#, 30][[1]]] &;

Visualize the feature space embedding performed by the audio encoder. Notice that the audio samples from the same class are clustered together:

In[18]:=
FeatureSpacePlot[audios, FeatureExtractor -> extractor, LabelingFunction -> Callout]
Out[18]=

Language identification

Whisper can transcribe and translate audio from 99 languages, with Whisper Large adding support for Cantonese. Retrieve the list of available languages from the label set:

In[19]:=
languages = NetModel["Whisper-V1 Multilingual Nets", "Labels"][[50260 ;; 50359]]
Out[19]=

Obtain a collection of audio samples featuring speakers of different languages:

In[20]:=
(* Evaluate this cell to get the example input *) CloudGet["https://www.wolframcloud.com/obj/90050448-1422-4037-a295-de3c3fda0b1a"]

Define a function to detect the language of the audio sample. Whisper determines the language by selecting the most likely language token after the initial pass of the decoder (the following code needs definitions from the "Evaluation function" section):

In[21]:=
getLanguage[audio_] := Module[{encoded, labels, languages, eosCode, textDecoder, initStates, init, probs, language, aud},
   aud = AudioPad[audio, 30 - Min[30, QuantityMagnitude[Duration[audio]]]];
   encoded = NetModel["Whisper-V1 Multilingual Nets"][aud];
   labels = DeleteCases[NetModel["Whisper-V1 Multilingual Nets", "Labels"], "|Cantonese|"];
   languages = <|50260 -> "|English|", 50261 -> "|Chinese|", 50262 -> "|German|", 50263 -> "|Spanish|", 50264 -> "|Russian|", 50265 -> "|Korean|", 50266 -> "|French|", 50267 -> "|Japanese|", 50268 -> "|Portuguese|", 50269 -> "|Turkish|", 50270 -> "|Polish|", 50271 -> "|Catalan|", 50272 -> "|Dutch|", 50273 -> "|Arabic|", 50274 -> "|Swedish|", 50275 -> "|Italian|", 50276 -> "|Indonesian|", 50277 -> "|Hindi|", 50278 -> "|Finnish|", 50279 -> "|Vietnamese|", 50280 -> "|Hebrew|", 50281 -> "|Ukrainian|", 50282 -> "|Greek|", 50283 -> "|Malay|", 50284 -> "|Czech|", 50285 -> "|Romanian|", 50286 -> "|Danish|", 50287 -> "|Hungarian|", 50288 -> "|Tamil|", 50289 -> "|Norwegian|", 50290 -> "|Thai|", 50291 -> "|Urdu|", 50292 -> "|Croatian|", 50293 -> "|Bulgarian|", 50294 -> "|Lithuanian|", 50295 -> "|Latin|", 50296 -> "|Maori|", 50297 -> "|Malayalam|", 50298 -> "|Welsh|", 50299 -> "|Slovak|", 50300 -> "|Telugu|", 50301 -> "|Persian|", 50302 -> "|Latvian|", 50303 -> "|Bengali|", 50304 -> "|Serbian|", 50305 -> "|Azerbaijani|", 50306 -> "|Slovenian|", 50307 -> "|Kannada|", 50308 -> "|Estonian|", 50309 -> "|Macedonian|", 50310 -> "|Breton|", 50311 -> "|Basque|", 50312 -> "|Icelandic|", 50313 -> "|Armenian|", 50314 -> "|Nepali|", 50315 -> "|Mongolian|", 50316 -> "|Bosnian|", 50317 -> "|Kazakh|", 50318 -> "|Albanian|", 50319 -> "|Swahili|",
      50320 -> "|Galician|", 50321 -> "|Marathi|", 50322 -> "|Punjabi|", 50323 -> "|Sinhala|", 50324 -> "|Khmer|", 50325 -> "|Shona|", 50326 -> "|Yoruba|", 50327 -> "|Somali|", 50328 -> "|Afrikaans|", 50329 -> "|Occitan|", 50330 -> "|Georgian|", 50331 -> "|Belarusian|", 50332 -> "|Tajik|", 50333 -> "|Sindhi|", 50334 -> "|Gujarati|", 50335 -> "|Amharic|", 50336 -> "|Yiddish|", 50337 -> "|Lao|", 50338 -> "|Uzbek|", 50339 -> "|Faroese|", 50340 -> "|Haitian creole|", 50341 -> "|Pashto|", 50342 -> "|Turkmen|", 50343 -> "|Nynorsk|", 50344 -> "|Maltese|",
      50345 -> "|Sanskrit|", 50346 -> "|Luxembourgish|", 50347 -> "|Myanmar|", 50348 -> "|Tibetan|", 50349 -> "|Tagalog|",
      50350 -> "|Malagasy|", 50351 -> "|Assamese|", 50352 -> "|Tatar|", 50353 -> "|Hawaiian|", 50354 -> "|Lingala|", 50355 -> "|Hausa|", 50356 -> "|Bashkir|", 50357 -> "|Javanese|", 50358 -> "|Sundanese|"|>;
   eosCode = 50259;
   textDecoder = NetModel[{"Whisper-V1 Multilingual Nets", "Part" -> "TextDecoder"}];
   initStates = AssociationMap[Function[x, {}], Select[Information[textDecoder, "InputPortNames"], StringStartsQ["State"]]];
   init = Join[
     <|
      "Index" -> 1,
      "Input1" -> eosCode,
      "Input2" -> encoded
      |>,
     initStates
     ];
   probs = textDecoder[init, NetPort[{"softmax", "Output"}]];
   language = sample[probs, 0, Complement[Range[Length[labels]], Keys[languages]]];
   labels[[language]]
   ];

Detect the languages:

In[22]:=
Map[getLanguage, audios]
Out[22]=

Transcribe and translate the audio samples:

In[23]:=
Dataset@KeyValueMap[<|<|"Language" -> #1|> -> <|
      "Transcription" -> netevaluate[#2, "Language" -> #1], "Translation" -> netevaluate[#2, "Task" -> "Translate", "Language" -> #1]|>|> &,
   audios]
Out[23]=

Advanced usage

Set the option "IncludeTimestamps" to True to add timestamps at the beginning and end of the audio:

In[24]:=
audio = ResourceData["Sample Audio: Apollo 11 One Small Step"]
Out[24]=
In[25]:=
netevaluate[audio, "IncludeTimestamps" -> True]
Out[25]=

The option "SuppressSpecialTokens" removes the non-speech tokens. Compare the transcription of the original audio sample with the sample after "SuppressSpecialTokens" is enabled:

In[26]:=
(* Evaluate this cell to get the example input *) CloudGet["https://www.wolframcloud.com/obj/efa3b600-9a15-4295-8f08-e427cad6bb14"]
In[27]:=
netevaluate[audio]
Out[27]=
In[28]:=
netevaluate[audio, "SuppressSpecialTokens" -> True]
Out[28]=

Transcription and Translation generation

The translation pipeline makes use of two separate transformer nets, encoder and decoder:

In[29]:=
audioEncoder = NetModel["Whisper-V1 Multilingual Nets", "Part" -> "AudioEncoder"]
Out[29]=
In[30]:=
textDecoder = NetModel[{"Whisper-V1 Multilingual Nets", "Part" -> "TextDecoder"}]
Out[30]=

The encoder preprocesses the input audio into a log-Mel spectrogram, capturing the signal's frequency content over time:

In[31]:=
lms = NetTake[
  NetModel["Whisper-V1 Multilingual Nets"], {"logMelSpectrogram"}]
Out[31]=

Get an input audio sample and compute its log-Mel spectrogram:

In[32]:=
(* Evaluate this cell to get the example input *) CloudGet["https://www.wolframcloud.com/obj/48661251-eff7-4703-a20c-f2d437fdceb3"]
In[33]:=
logMelSpectrogram = lms[audio];

Visualize the log-Mel spectrogram and the audio waveform:

In[34]:=
GraphicsColumn[{
  AudioPlot[audio, PlotRange -> {0, 5}, PlotLabel -> "Audio Waveform",
    FrameTicks -> None, ImageSize -> {300, 100}],
  MatrixPlot[logMelSpectrogram, PlotLabel -> "Log-mel Spectrogram", ColorFunction -> "Rainbow", FrameTicks -> None, ImageSize -> {300, 100}, PlotRange -> {{0, 80}, {0, 500}}]}, ImageSize -> Medium]
Out[34]=

The encoder processes the input once, producing a feature matrix of size 1500x768:

In[35]:=
audioFeatures = audioEncoder[audio];
In[36]:=
Dimensions[audioFeatures]
Out[36]=

The decoding step involves running the decoder multiple times recursively, with each iteration producing a subword token of the translated or transcribed audio. The decoder receives several inputs:

In[37]:=
Information[textDecoder, "InputPorts"]
Out[37]=

• The port "Input1" takes the subword token generated by the previous evaluation of the decoder.

• The port "Index" takes an integer keeping count of how many times the decoder was evaluated (positional encoding).

• The port "Input2" takes the encoded features produced by the encoder. The data fed to this input is the same for every evaluation of the decoder.

• The ports "State1", "State2"... take the self-attention key and value arrays for all the past tokens. Their size grows by one at each evaluation. The default ("Size" -> "Small") decoder has 12 attention blocks, which makes for 24 states: 12 key arrays and 12 value arrays.

The initial prompt for the decoder is a sequence of context tokens that guides Whisper's decoding process by specifying the task to perform and the audio's language. These tokens can be hard-coded to explicitly control the output or left flexible, allowing the model to automatically detect the language and task. Define the initial prompt for transcribing audio in Spanish:

In[38]:=
prompt = {StartOfString, "|Spanish|", "|Transcribe|", "|NoTimestamps|"};

Retrieve the integer codes of the prompt tokens:

In[39]:=
labels = DeleteCases[NetModel["Whisper-V1 Multilingual Nets", "Labels"], "|Cantonese|"];
promptCodes = Flatten@Map[Position[labels, #] &, prompt]
Out[40]=

Before starting the decoding process, initialize the decoder's inputs:

In[41]:=
initStates = AssociationMap[Function[x, {}], Select[Information[textDecoder, "InputPortNames"], StringStartsQ["State"]]];
In[42]:=
index = 1;
sosCode = 50259;
init = Join[
   <|"Index" -> index,
    "Input1" -> sosCode,
    "Input2" -> audioFeatures
    |>,
   initStates
   ];

Use the decoder iteratively to transcribe the audio. The recursion keeps going until the EndOfString token is generated or the maximum number of iterations is reached:

In[43]:=
eosCode = 50258;
isGenerating = False;
tokens = {};
NestWhile[
  Function[
   If[SameQ[index, Length[prompt]], isGenerating = True];
   netOut = textDecoder[#];
   If[isGenerating, AppendTo[tokens, netOut["Output"]]];
   Join[
    KeyMap[StringReplace["OutState" -> "State"], netOut] (*include last states*),
    <|"Index" -> ++index, (*update index*)
     "Input1" -> If[isGenerating, netOut["Output"], promptCodes[[index]]], (*input last generated token*)
     "Input2" -> audioFeatures (*audio features for transcription*)
     |>
    ]
   ],
  init,
  #Input1 =!= eosCode &,(*stops when EndOfString token is generated*)
  1,
  100 (*Max iterations*)
  ];

Display the generated tokens:

In[44]:=
tokens
Out[44]=

Obtain a readable representation of the tokens by converting the text into UTF8:

In[45]:=
FromCharacterCode[
 Flatten@ToCharacterCode[labels[[Most@tokens]], "Unicode"], "UTF8"]
Out[45]=

Change the task type to translate by assigning the third element in the prompt list to "|Translate|":

In[46]:=
prompt = {StartOfString, "|Spanish|", "|Translate|", "|NoTimestamps|"};
promptCodes = Flatten@Map[Position[labels, #] &, prompt]
Out[47]=

Generate again based on the new prompt:

In[48]:=
index = 1;
isGenerating = False;
tokens = {};
NestWhile[
  Function[
   If[SameQ[index, Length[prompt]], isGenerating = True];
   netOut = textDecoder[#];
   If[isGenerating, AppendTo[tokens, netOut["Output"]]];
   Join[
    KeyMap[StringReplace["OutState" -> "State"], netOut] (*include last states*),
    <|"Index" -> ++index, (*update index*)
     "Input1" -> If[isGenerating, netOut["Output"], promptCodes[[index]]], (*input last generated token*)
     "Input2" -> audioFeatures (*audio features for transcription*)
     |>
    ]
   ],
  init,
  #Input1 =!= eosCode &,(*stops when EndOfString token is generated*)
  1,
  100 (*Max iterations*)
  ];

Display the generated tokens:

In[49]:=
tokens
Out[49]=

Obtain a readable representation of the tokens:

In[50]:=
FromCharacterCode[
 Flatten@ToCharacterCode[labels[[Most@tokens]], "Unicode"], "UTF8"]
Out[50]=

Net information

Inspect the number of parameters of all arrays in the net:

In[51]:=
Information[
 NetModel["Whisper-V1 Multilingual Nets"], "ArraysElementCounts"]
Out[51]=

Obtain the total number of parameters:

In[52]:=
Information[
 NetModel["Whisper-V1 Multilingual Nets"], "ArraysTotalElementCount"]
Out[52]=

Obtain the layer type counts:

In[53]:=
Information[
 NetModel["Whisper-V1 Multilingual Nets"], "LayerTypeCounts"]
Out[53]=

Display the summary graphic:

In[54]:=
Information[NetModel["Whisper-V1 Multilingual Nets"], "SummaryGraphic"]
Out[54]=

Resource History

Reference

  • A. Radford, J. W. Kim, T. Xu, G. Brockman, C. McLeavey, I. Sutskever, "Robust Speech Recognition via Large-Scale Weak Supervision," arXiv:2212.04356v1 (2022)
  • Available from: https://github.com/openai/whisper
  • Rights: MIT License