Wolfram Function Repository
Instant-use add-on functions for the Wolfram Language
Function Repository Resource:
Tokenize an input string into a list of integers from a vocabulary that was originally used to train GPT nets
ResourceFunction["GPTTokenizer"][] returns a GPT NetEncoder. | |
ResourceFunction["GPTTokenizer"]["string"] tokenizes an input "string" into a list of integers from the GPT neural net vocabulary. | |
ResourceFunction["GPTTokenizer"][{"input1", "input2","input3", … }] tokenizes a list of strings into a list of integer tokens from the GPT neural net vocabulary. |
GPT-3.5 | Utilizes the 'CL100K' vocabulary set. This vocabulary is also employed by other models including GPT-4, GPT-3.5-turbo, text-embedding-ada-002, text-embedding-3-small, and text-embedding-3-large. |
GPT-4o | Utilizes the 'O200K_base' vocabulary set. This vocabulary is also employed by other models including GPT-4o. |
GPT-2 | Employs the 'R50K' vocabulary set. GPT-3 models, such as Davinci, utilize this same vocabulary. |
P50K | Employs the 'P50K' vocabulary set. Codex models, such as text-davinci-002, text-davinci-003, utilize this same vocabulary. |
Encode a string of characters:
In[1]:= |
Out[1]= |
Encode a string of characters using "GPT-3.5" Method:
In[2]:= |
Out[2]= |
In[3]:= |
Out[3]= |
Encode a string of characters using "GPT-2" Method:
In[4]:= |
Out[4]= |
Encode a string of characters using "P50K" Method:
In[5]:= |
Out[5]= |
Get the GPT NetEncoder:
In[6]:= |
Out[6]= |
Check that tokenization is the same:
In[7]:= |
Out[7]= |
Wolfram Language 13.0 (December 2021) or above
This work is licensed under a Creative Commons Attribution 4.0 International License