Visualizing tokenization
OpenAI publishes tiktoken, an open-source Python library that tokenizes input strings as it is done internally for their product. This is useful for estimating costs as well as ensuring inputs stay within certain limits (context windows etc.). gt4openai
also relies on this library internally, for instance to estimate the cost of a fine-tuning job.
To evaluate the result of tokenizing with various models, you can work directly with the tiktoken library. But you can also work with a small wrapper we created that helps us visualize the results. To this end, first install the gtoolkit_tiktokenize
module:
PBApplication isRunning ifFalse: [ PBApplication start ]. PBApplication uniqueInstance installModule: 'gtoolkit_tiktokenize'
And then just execute the Python script:
import gtoolkit_tiktokenize model = "gpt-4" string = "<|im_start|>system<|im_sep|>You are a helpful assistant<|im_end|><|im_start|>assistant<|im_sep|>How can I help you?<|im_end|><|im_start|>user<|im_sep|>Can you add two and two for me?<|im_end|>" gtoolkit_tiktokenize.tokenize(string, model)
The equivalent in Pharo code is as follows:
GtLlmTokenizer new tokenizeMessages: {GtLlmSystemMessage new content: 'You are a helpful assistant'. GtLlmAssistantMessage new content: 'How can I help you?'. GtLlmUserMessage new content: 'Can you add two and two for me?'} usingModel: 'gpt-4'