Explaining Andrej Karpathy's tokenization explanation

A while back, Andrej Karpathy put together a tutorial on tokenization. The tutorial is excellent and it includes a detailed stepwise implementation of an algorithm that was of great interest at the time.

But we want to focus here on the introductory part which starts with an explanation of what tokenization is. He carried out the explanation using tiktokenizer.vercel.app. The app is using the tiktoken library from OpenAI and displays the result graphically. You can see a snapshot from the video below.

Explaining tokenization with Tiktokenizer

It is interesting that a tool can explain the meaning of a technical algorithm to newcomers. However, this explanation comes from an external tool, outside of the environment he is using for development, which is a notebook in his case. But what if we bring that explanation into the development environment?

Explaining the algorithm's output through custom views

Let's see. In his tutorial, Andrej builds a tokenization engine. So, to show how the explanation can look like, we took his code and the tiktoken library and wrapped it in a small project called gtoolkit_tiktokenize. We first setup the Python runtime:

PBApplication isRunning ifFalse: [ PBApplication start ].
PBApplication uniqueInstance installModule: 'gtoolkit_tiktokenize'
  

And then we can just invoke the tokenization, and the resulting object offers the explanation in the inspector.

import gtoolkit_tiktokenize

model = "gpt-2"
string = "<|im_start|>system<|im_sep|>You are a helpful assistant<|im_end|><|im_start|>assistant<|im_sep|>How can I help you?<|im_end|><|im_start|>user<|im_sep|>Can you add two and two for me?<|im_end|>"
gtoolkit_tiktokenize.tokenize(string, model)
  

The sample from above shows that we can express the explanation as a rather simple inspector extension of the domain model. But if we can do that, can we do more? Indeed we can.

Explaining the algorithm's steps through custom views

The tokenization logic is iterative. The input string gets transformed into numbers in steps, each step performing a compressing operation. Let's bring the explanation to each of these steps.

filePath := (GtResourcesUtility * 'feenkcom/gtoolkit-demos/python') fullName
  
import sys
sys.path.append(filePath)
import tokenization
encoder = tokenization.BPEEncoder()
encoder.train(string, 286)
encoder.encode(string)
  

In this case we not only see the visualization of the overall output, we also see the visualization of each merge. And when we concatenate the list of all merges with the visualization of each merge, we basically get a postmortem debugger.

Why are such explanations important?

Tokenization is a reasonably relevant topic these days. The original video alone was seen some half a million times already. So, explaining it is quite relevant. Now, Andrej is a fantastic teacher, but to provide his explanation he went to an app outside of his development. And when we got to explaining his algorithm, the explanations were delivered with plain numbers and strings. All these show is that the level of expectation about how explainable systems can be is low in our industry. Interesting explanations are rare and when one does appear it is somewhere else, removed from where development happens.

In contrast, our little demo shows that with little added effort we can make interesting explanations be integral part of the development flow. These explanations should not be unusual events. They should be pervasive. Everywhere. All the time.

Of course, for this to work, we need a new kind of a development environment. A moldable development environment. Like Glamorous Toolkit.