Creating a dataset for fine-tuning

The most important part of a fine-tuning run is the dataset. A natural target for creating a dataset in Glamorous Toolkit is examples: they offer code and live objects as well as views.

In this page, we will look at generating a dataset for a tutor generating Phlow views. As such, we will need objects and their Phlow views.

To create a dataset from examples, we first need to gather the ones we require.

aCollection := Smalltalk gtExamplesContained
		collect: [ :eachExample | eachExample asCachedExampleWithResult ].
aGroup := GtExampleGroup withAll: aCollection
  

This selects all examples currently defined. We can then run all of the examples. Caution: this will take a long time, multiple hours; it’s best to either let them run asynchronously or to select only a subset of examples.

If desired the number of examples can be limited by selecting the first X examples only.

numberOfExamples := 1000.
aGroup := aGroup first: numberOfExamples
  

Finally, all the examples need to be run.

aGroup runNotYetExecuted
  

After all the example results have been collected, we can convert them into conversations for our dataset.

tutor := GtLlmTutor new.

conversations := (aGroup select: [ :anExample | anExample result isNotNil ])
		flatCollect: [ :anExample | 
			(GtLlmExampleViewConversationCollector new
				example: anExample;
				instructions: tutor instruction) conversations ]
  

Once this is done, we can create a fine-tuning file as specified inCreating a fine-tuned model.

file := GtLlmFineTuningFile new
		name: 'fine-tuning.jsonl';
		model: 'gpt-4o-mini-2024-07-18';
		conversations: conversations.
file costsPerEpoch
  

If you know how many epochs of training will be performed, you can also specify them:

file costsForEpochs: 3
  

Note that because of some special discount cases surrounding newer models, this is not equivalent to doing costsPerEpoch * numberOfEpochs.

If the cost is not prohibitive, we can then start a fine-tuning.

client := GtOpenAIClient withApiKeyFromFile.

openAiFile := client uploadFile: file withPurpose: 'fine-tune'.

fineTuningJob := client createFineTuningJobOnModel: file model withFile: openAiFile id