Function Repository Resource:

GPTTokenizer (1.0.0) current version: 1.2.0 »

Source Notebook

Tokenize an input string into a list of integers from a vocabulary that was originally used to train GPT nets

Contributed by: Maria Sargsyan

ResourceFunction["GPTTokenizer"][]

returns a GPT NetEncoder.

ResourceFunction["GPTTokenizer"]["string"]

tokenizes an input "string" into a list of integers from the GPT neural net vocabulary.

Examples

Basic Examples

Encode a string of characters:

In[1]:=
ResourceFunction["GPTTokenizer"]["Hello world"]
Out[1]=

Get the GPT NetEncoder:

In[2]:=
encoder = ResourceFunction["GPTTokenizer"][]
Out[2]=

Check that tokenization is the same:

In[3]:=
encoder["Hello world"] === ResourceFunction["GPTTokenizer"]["Hello world"]
Out[3]=

Version History

  • 1.2.0 – 26 June 2024
  • 1.1.0 – 14 February 2024
  • 1.0.0 – 22 March 2023

Related Resources

License Information