Molding CSV data — preparing the data
This is Part 1 of Tutorial: molding CSV data into an explainable system. In this part we introduce the CSV dataset of a research project, explore the data to develop some initial questions, and transform the CSV data into JSON to prepare for modeling the domain entities hidden in the data.
Moldable development can be applied to any kind of software. In this case our starting point is a research project that reviewed scientific publications on software visualization in order to gain insight into how different software visualizations map to specific developer needs. The results are described in the paper Towards actionable visualization for software developers (a PDF preprint is available). The dataset summarizing the contents of the studied publications is available from a webpage briefly summarizing the results, in CSV format.
To keep things as simple as possible, we have copied the two relevant dataset files to a local folder within GT:
⇒ Click on the .csv files above to inspect them.
Let's suppose we are interested in exploring the datasets from this study. The first one, called dataset-review-vissoft-all.csv, summarizes the 346 papers published in the software visualization conferences
SOFTVIS
and
VISSOFT
from 2003 through 2015, and the second dataset, dataset-review-vissoft-selected.csv, provides more detailed information about the 65 selected papers that describe
Design Studies
.
We can directly view the source data, or we can open these files using a spreadsheet software, such as Excel. ⇒ Try it — click below on the Open in operating system button at the top right.
When we start to browse the raw data, many questions might come to mind, for example: — Which are the selected design study papers? — How many authors are there? — Which authors published most frequently? — What different types of papers were identified? — Which authors wrote about design studies? — What other topics did those authors write about? — Are the two datasets consistent? (Are there any errors in matching them up?) And many more ...
When we explore the raw data, however, even as a spreadsheet, we just get a very generic, table-oriented view of all the data. With a bit of spreadsheet savvy, we can probably answer many of these questions by manipulating and transforming the raw data, but each answer would be hand-crafted, and it would be hard to navigate the data to answer similar questions about subsets of data.
A different way to proceed is to extract a model of domain objects from the raw data, decorated with tiny contextual tools that answer our questions . The model should consist of live objects , so we can interact with it to explore it. Before we can do that, it would help to transform the data to a more convenient form. Since each row of the first dataset represents a publication and each row of the second one represents a design study, it would help to extract the data row by row. JSON is a convenient representation for this data.
We would like to have a domain object for each publication, but a row of a spreadsheet without the header is not a convenient representation. Our job will be easier if we convert the CSV files to JSON. Luckily we already have a utility to convert CSV files to JSON, so we can just use that. (There is a moldable development pattern, Tooling Buildup, which we can skip in this case, because we already have the basic tools we need.)
Our copies of the dataset files reside in the following folder:
datasetFolder := FileLocator gtResource / 'feenkcom' / 'gtoolkit-demos' / 'data' / 'visualisation-review'
Note that we are writing code here directly in a notebook page. In this way we can directly try out snippets of code, while also documenting what we are doing. This is an example of the pattern Project Diary — instead of directly coding in the IDE, we program in notebook pages so we can track our progress and document our process.
⇒ Click on the Evaluate and inspect button (first button at the bottom of the snippet above) We then obtain the view we had earlier of the dataset folder. We can grab the first file and convert it to JSON, like this:
allFile := datasetFolder / 'dataset-review-vissoft-all.csv'. allJson := (CSV2JSON for: allFile contents) json
⇒ Inspect the resulting JSON object.
Actually, we are not so much interested in the JSON format itself, but in the jsonObject view, which is just a Dictionary
of keys and values. We can convert the CSV dataset to an Array
of such dictionaries by sending jsonObject to the CSV2JSON
parse tree visitor.
allJson := (CSV2JSON for: allFile contents) jsonObject
⇒ Inspect this result and note how it differs from the previous result.
This looks like a good way to break up the original CSV files into smaller chunks that represent the data for individual objects, though they are still just data , not domain objects.
At this point it would be useful to turn each of these entities into an Example Object that exists also outside of this notebook page. This illustrates the principle that we first prototype code, starting from a notebook page, or playground (such as the code snippet above), and then extract interesting classes and methods.
Consider the following snippet:
VRDatasetExamples new datasetFolder
This code is invalid, because the class VRDatasetExamples does not yet exist, nor does the class-side method datasetFolder. If you click on the
fixit
(wrench) icon, you can create the missing class.
⇒
Click on the fixit button. Enter the package name VisualisationReview and the tag Examples. Then click on the checkmark ✓ lower left to accept the changes.
Now we would like to turn the following code snippet into an example.
⇒
Evaluate and inspect the snippet to ensure the result is what you expect.
⇒
Select the code, primary (right) click, and select the
Extract example
code refactoring.
⇒
Fill in the class name VRDatasetExamples and method datasetFolder exactly as they appear above, and accept the code change.
FileLocator gtResource / 'feenkcom' / 'gtoolkit-demos' / 'data' / 'visualisation-review'
VRDatasetExamples new datasetFolder
Now the code has been extracted as an example method, but it is missing assertions.
⇒
Edit the datasetFolder method (click on the grey triangle to the right of the name to edit the method) to look like this:
This code defines an Example, a method that serves both as a
factory
for an object of interest, and as a
test case
containing assertions. Example methods are
unary
methods that are annotated with <gtExample>, and return some object. In this case we return a file reference to the dataset folder, while verifying that the folder exists and contains precisely two CSV files.
Now we can do the same to make examples of the two CSV datasets converted to JSON.
VRDatasetExamples new allPapersJSON
VRDatasetExamples new selectedPapersJSON
We could extract these methods from the code snippets, as we did above, or we can also simply load in the code from the Changes snippet below. ⇒ Inspect this code to see how the additional examples are defined, and accept the changes (✓ button). ⇒ Can you think of some more critical assertions to include in the examples?
VisualisationReview
"Package: VisualisationReview"
Examples
Object << #VRDatasetExamples
slots: {};
tag: 'Examples';
package: 'VisualisationReview'.
Object class << VRDatasetExamples class
slots: {}
examples
"protocol: #examples"
VRDatasetExamples >> allPapersJSON
<gtExample>
| json |
json := (CSV2JSON
for: (self datasetFolder / 'dataset-review-vissoft-all.csv') contents)
jsonObject.
self assert: json size equals: 346.
^ json
"protocol: #examples"
VRDatasetExamples >> datasetFolder
<gtExample>
| folder |
folder := FileLocator gtResource / 'feenkcom' / 'gtoolkit-demos'
/ 'data' / 'visualisation-review'.
self assert: folder exists.
self assert: (folder allChildrenMatching: '*.csv') size equals: 2.
^ folder
"protocol: #examples"
VRDatasetExamples >> selectedPapersJSON
<gtExample>
| json |
json := (CSV2JSON
for: (self datasetFolder / 'dataset-review-vissoft-selected.csv') contents)
jsonObject.
self assert: json size equals: 65.
^ json
To sync to this point in the tutorial (throwing away any other changes) evaluate:
VRTutorialExamples new fileIn1DatasetExamples
NB: This page links to Pages containing missing references - allowed failures to allow references to missing classes and methods in this page. Next: Part 2. Molding CSV data — modeling domain objects