Function Repository Resource:

GPTTokenizer

Source Notebook

Tokenize an input string into a list of integers from a vocabulary that was originally used to train GPT nets

Contributed by: Maria Sargsyan, Giulio Alessandrini

ResourceFunction["GPTTokenizer"][]

returns a GPT NetEncoder.

ResourceFunction["GPTTokenizer"]["string"]

tokenizes an input "string" into a list of integers from the GPT neural net vocabulary.

ResourceFunction["GPTTokenizer"][{"input1", "input2","input3", }]

tokenizes a list of strings into a list of integer tokens from the GPT neural net vocabulary.

Details and Options

GPTTokenizer takes a Method option with the following possible values:
GPT-3.5Utilizes the 'CL100K' vocabulary set. This vocabulary is also employed by other models including GPT-4, GPT-3.5-turbo, text-embedding-ada-002, text-embedding-3-small, and text-embedding-3-large.
GPT-2Employs the 'R50K' vocabulary set. GPT-3 models, such as Davinci, utilize this same vocabulary.
P50KEmploys the 'P50K' vocabulary set. Codex models, such as text-davinci-002, text-davinci-003,, utilize this same vocabulary.

Examples

Basic Examples (1) 

Encode a string of characters:

In[1]:=
ResourceFunction["GPTTokenizer"]["Hello world"]
Out[1]=

Options (3) 

Encode a string of characters using "GPT-3.5" Method:

In[2]:=
ResourceFunction["GPTTokenizer"]["Hello world", Method -> "GPT-3.5"]
Out[2]=

Encode a string of characters using "GPT-2" Method:

In[3]:=
ResourceFunction["GPTTokenizer"]["Hello world", Method -> "GPT-2"]
Out[3]=

Encode a string of characters using "P50K" Method:

In[4]:=
ResourceFunction["GPTTokenizer"]["Hello world", Method -> "P50K"]
Out[4]=

Application (2) 

Get the GPT NetEncoder:

In[5]:=
encoder = ResourceFunction["GPTTokenizer"][]
Out[5]=

Check that tokenization is the same:

In[6]:=
encoder["Hello world"] === ResourceFunction["GPTTokenizer"]["Hello world"]
Out[6]=

Requirements

Wolfram Language 13.0 (December 2021) or above

Version History

  • 1.1.0 – 14 February 2024
  • 1.0.0 – 22 March 2023

Related Resources

License Information