Molding CSV data — preparing the data

This is Part 1 of Tutorial: molding CSV data into an explainable system. In this part we introduce the CSV dataset of a research project, explore the data to develop some initial questions, and transform the CSV data into JSON to prepare for modeling the domain entities hidden in the data.

Starting point: CSV data from a software visualization publication review study

Moldable development can be applied to any kind of software. In this case our starting point is a research project that reviewed scientific publications on software visualization in order to gain insight into how different software visualizations map to specific developer needs. The results are described in the paper Towards actionable visualization for software developers (a PDF preprint is available). The dataset summarizing the contents of the studied publications is available from a webpage briefly summarizing the results, in CSV format.

To keep things as simple as possible, we have copied the two relevant dataset files to a local folder within GT:

What we see here is an object inspector on the folder containing the dataset. You might not recognize it as such because an inspector usually just shows a generic, “raw” view of an object within a debugger. In this case the inspector has been molded to show a more useful Tree view of the contents of the folder. ⇒ Click on the Raw view and compare it with the Tree view. ⇒ Return to the Tree view and click on the .csv files to inspect them.

Exploring the CSV — posing some initial questions

Let's suppose we are interested in exploring the datasets from this study. The first one, called dataset-review-vissoft-all.csv, summarizes the 346 papers published in the software visualization conferences SOFTVIS and VISSOFT from 2003 through 2015, and the second dataset, dataset-review-vissoft-selected.csv, provides more detailed information about the 65 selected papers that describe Design Studies .

When we start to browse the raw data, many questions might come to mind, for example: — Which are the selected design study papers? — How many authors are there? — Which authors published most frequently? — What different types of papers were identified? — Which authors wrote about design studies? — What other topics did those authors write about? — Are the two datasets consistent? (Are there any errors in matching them up?) And many more ... We can directly view the source data, or we can open these files using a spreadsheet software, such as Excel, but it is hard to answer our questions with the raw data. ⇒ Try it — click below on the Open in operating system button at the top right.

When we explore the raw data, however, even as a spreadsheet, we just get a very generic, table-oriented view of all the data. With a bit of spreadsheet savvy, we can probably answer many of these questions by manipulating and transforming the raw data, but each answer would be hand-crafted, and it would be hard to navigate the data to answer similar questions about subsets of data.

Transforming the raw data — converting the CSV to JSON

A different way to proceed is to extract a model of domain objects from the raw data, decorated with tiny contextual tools that answer our questions . The model should consist of live objects , so we can interact with it to explore it.

Before we can do that, it would help to transform the data to a more convenient form. Since each row of the first dataset represents a publication and each row of the second one represents a design study, it would help to extract the data row by row. JSON is a convenient representation for this data. Luckily we already have a utility to convert CSV files to JSON, so we can just use that. Our copies of the dataset files reside in the following folder:

datasetFolder := FileLocator gtResource / 'feenkcom' / 'gtoolkit-demos' / 'data'
		/ 'visualisation-review'

⇒ Click on the Evaluate and inspect button (first button at the bottom of the snippet above) We then obtain the view we had earlier of the dataset folder. We can grab the first file and convert it to JSON, like this:

allPapersFile := datasetFolder / 'dataset-review-vissoft-all.csv'.
allJson := (CSV2JSON for: allPapersFile contents) json

⇒ Inspect the resulting JSON object.

Actually, we are not so much interested in the JSON format itself, but in the jsonObject view, which is just a Dictionary HashedCollection << #Dictionary slots: {}; tag: 'Dictionaries'; package: 'Collections-Unordered' of keys and values. We can convert the CSV dataset to an Array ArrayedCollection << #Array layout: VariableLayout; slots: {}; tag: 'Base'; package: 'Collections-Sequenceable' of such dictionaries by sending jsonObject to the CSV2JSON CSVParseNodeVisitor << #CSV2JSON slots: { #ast }; tag: 'Visitor'; package: 'GToolkit-Demo-ParseNodeVisitor' parse tree visitor.

allJson := (CSV2JSON for: allPapersFile contents) jsonObject

⇒ Inspect this result and note how it differs from the previous result.

This looks like a good way to break up the original CSV files into smaller chunks that represent the data for individual objects, though they are still just data , not domain objects.

Now let's extract this code as an example method , a method that returns an object of interest.

FileLocator gtResource / 'feenkcom' / 'gtoolkit-demos' / 'data'
	/ 'visualisation-review'

⇒ Evaluate and inspect the code above. ⇒ Select, right-click and Extract example (not extract method). ⇒ Set the class to VRDatasetExamples and the method to dataFolder ⇒ Enter the package name VisualisationReview and the tag Examples. Then click on the checkmark ✓ lower left to accept the changes. The transformed snippet should look like this:

VRDatasetExamples new dataFolder

Now the code has been extracted as an example method, but it is missing assertions.

⇒ Edit the datasetFolder method (click on the grey triangle to the right of the name to edit the method) to look like this:

Now we can do the same to make examples of the two CSV datasets converted to JSON.

allPapersFile := VRDatasetExamples new datasetFolder / 'dataset-review-vissoft-all.csv'.  
(CSV2JSON for: allPapersFile contents) json

⇒ Extract the code above into an example called allPapersJSON. ⇒ Also extract the snippet below into an example selectedPapersJSON. ⇒ Add assertions for the expected size of the result.

selectedPapersFile := VRDatasetExamples new datasetFolder / 'dataset-review-vissoft-selected.csv'.  
(CSV2JSON for: selectedPapersFile contents) json

We could extract these methods from the code snippets, as we did above, or we can also simply load in the code from the Changes snippet below. ⇒ Compare your results with those of the changes snippet below. Hint: Inspect the snippet ([i] button bottom left) to see the code. Click the √ button to accept those changes.

VisualisationReview

"Package: VisualisationReview"
Examples

Object << #VRDatasetExamples
	slots: {};
	tag: 'Examples';
	package: 'VisualisationReview'.

Object class << VRDatasetExamples class
	slots: {}

examples

"protocol: #examples"

VRDatasetExamples >> allPapersJSON
	<gtExample>
	| json |
	json := (CSV2JSON
			for: (self datasetFolder / 'dataset-review-vissoft-all.csv') contents)
			jsonObject.
	self assert: json size equals: 346.
	^ json

"protocol: #examples"

VRDatasetExamples >> datasetFolder
	<gtExample>
	| folder |
	folder := FileLocator gtResource / 'feenkcom' / 'gtoolkit-demos'
			/ 'data' / 'visualisation-review'.
	self assert: folder exists.
	self assert: (folder allChildrenMatching: '*.csv') size equals: 2.
	^ folder

"protocol: #examples"

VRDatasetExamples >> selectedPapersJSON
	<gtExample>
	| json |
	json := (CSV2JSON
			for: (self datasetFolder / 'dataset-review-vissoft-selected.csv') contents)
			jsonObject.
	self assert: json size equals: 65.
	^ json

To sync to this point in the tutorial (throwing away any other changes) evaluate:

VRTutorialExamples new fileIn1DatasetExamples

Discussion

The initial process of preparing the data to be wrapped is a pattern known as Tooling Buildup. In this case we already have the tools we need to translate CSV to JSON, so there is nothing to build.

The Example methods we created are factory methods that also serve as test cases. Example methods are unary methods that are annotated with <gtExample>, and return some object. In this case we return a file reference to the dataset folder, while verifying that the folder exists and contains precisely two CSV files. This pattern is known as Example Object. The process of turning snippets into examples illustrates the principle that we first prototype code, starting from a notebook page, or playground (such as the code snippet above), and then extract interesting classes and methods.

NB: This page links to Pages containing missing references - allowed failures to allow references to missing classes and methods in this page. Next: Part 2. Molding CSV data — modeling domain objects