Visualizing tokenization

OpenAI publishes tiktoken, an open-source Python library that tokenizes input strings as it is done internally for their product. This is useful for estimating costs as well as ensuring inputs stay within certain limits (context windows etc.). gt4openai also relies on this library internally, for instance to estimate the cost of a fine-tuning job.

To evaluate the result of tokenizing with various models, you can work directly with the tiktoken library. But you can also work with a small wrapper we created that helps us visualize the results. To this end, first install the gtoolkit_tiktokenize module:

PBApplication isRunning ifFalse: [ PBApplication start ].
PBApplication uniqueInstance installModule: 'gtoolkit_tiktokenize'

And then just execute the Python script:

import gtoolkit_tiktokenize

model = "gpt-4"
string = "<|im_start|>system<|im_sep|>You are a helpful assistant<|im_end|><|im_start|>assistant<|im_sep|>How can I help you?<|im_end|><|im_start|>user<|im_sep|>Can you add two and two for me?<|im_end|>"
gtoolkit_tiktokenize.tokenize(string, model)

The equivalent in Pharo code is as follows:

GtLlmTokenizer new
	tokenizeMessages: {GtLlmSystemMessage new content: 'You are a helpful assistant'.
			GtLlmAssistantMessage new content: 'How can I help you?'.
			GtLlmUserMessage new content: 'Can you add two and two for me?'}
	usingModel: 'gpt-4'