Molding CSV data — modeling other entities
This is Part 3 of Tutorial: molding CSV data into an explainable system. In this part, we continue by modeling and molding further domain entities, in particular a dataset as an entity that provides a simple API for finding papers and other domain objects, and authors who contribute to writing papers.
If you are jumping into the tutorial at this point, and have some previously saved work not already in this image, you should file it in now. Otherwise, you can file in a sample snapshot reflecting the work completed at the end of the previous part:
VRTutorialExamples new fileIn7PaperGroup
There are several more domain entities we would like to make explicit: authors, venues, and design studies. We also want to make sure that there is only one instance of each in an instantiated model of a dataset, so the same author doesn't appear multiple times. We can therefore introduce a Dataset entity that wraps the source CSV files, and keeps track of the instances we create. Instances of domain entities should know their dataset, so they can query it. Let's start by introducing the Dataset entity, and have it keep track of papers by ID, using a dictionary.
⇒
Create the VRDataset class with instance creation method forAllCSV:andSelectedCSV:.
⇒
Have this method convert the CSV files to JSON and store them using an instance initialization method allJSON:selectedJSON:.
⇒
Introduce slots allJSON and selectedJSON and their accessors, and use them in the initialization method.
⇒
Extract the code below as an example vrDataset.
VRDataset forAllCSV: VRDatasetExamples new datasetFolder / 'dataset-review-vissoft-all.csv' andSelectedCSV: VRDatasetExamples new datasetFolder / 'dataset-review-vissoft-selected.csv'
VRDatasetExamples new vrDataset
So far so good. Now we have to create and track the paper entities. ⇒ In the contextual playground of the dataset, evaluate this code:
This is an array of all the papers. We want (1) each paper to know which dataset it comes from, so it can use it to pose queries about other entities, and (2) we want to build a dictionary of these papers keyed by ID. ⇒ Adapt the instance creation method to add a new parameter for the dataset:
⇒ Fix the new method to look like this:
⇒
Continue to fix this method, adding the dataset slot and accessors.
⇒
Adapt the playground snippet to create a dictionary instead of an array of papers, for example, like this:
⇒
Extract this as a method paperDict.
⇒
Edit and fix the new paperDict method to cache the result in a paperDict slot:
Note that the paperDict slot is now lazily initialized upon its first access.
⇒
Prototype and extract an allPapers method.
Hint:
Just ask the dictionary for its values. Wrap the resulting array as a VRPaperGroup.
⇒
Extract an
All Papers
view for the dataset.
Hint:
In the contextual playground, evaluate self allPapers, and primary click to select
Create <gtView> forward to Papers.
Adapt the generated method name and the title to reflect the content (all papers).
⇒
Do the same for selected papers: prototype and extract a selectedPapers method, and extract a view for it.
Hint:
apply a select: query to the allPapers group.
⇒
Add priority: specifications to the
All Papers
and
Selected Papers
views, e.g., set them to 10 and 20, respectively.
Hint:
Secondary-click on the tabs to directly access their source code, without having to switch to the
Meta
view.
Now we have a dataset entity that holds a dictionary of papers, and provides two contextual views, one showing all the papers, and the second showing just the selected design study papers.
VRDatasetExamples new vrDataset
We still need a way to use the paper dictionary.
⇒
Evaluate the first snippet to assign the dataset and id variables.
⇒
Evaluate and inspect the second snippet.
⇒
Extract the second snippet as a method paperWithId:
dataset := VRDatasetExamples new vrDataset. id := 128.
dataset paperDict at: id.
⇒
Rewrite the paper128 example to retrieve the paper from the vrDataset example using the new paperWithId: API
VRDatasetExamples new paper128
Here are all the changes so far:
Object << #VRDataset
slots: { #allJSON . #selectedJSON . #paperDict . #authorDict };
tag: 'Model';
package: 'VisualisationReview'.
Object class << VRDataset class
slots: {}
"protocol: #'instance creation'"
VRDataset class >> forAllCSV: aCSVfile andSelectedCSV: anotherCSVFile
^ self new
allJSON: (CSV2JSON for: aCSVfile contents) jsonObject
selectedJSON: (CSV2JSON for: anotherCSVFile contents) jsonObject
"protocol: #initialization"
VRDataset >> allJSON: aDictionary selectedJSON: anotherDictionary
self allJSON: aDictionary.
self selectedJSON: anotherDictionary
"protocol: #accessing"
VRDataset >> selectedJSON
^ selectedJSON
"protocol: #accessing"
VRDataset >> allJSON
^ allJSON
"protocol: #accessing"
VRDataset >> selectedJSON: aDictionary
selectedJSON := aDictionary
"protocol: #accessing"
VRDataset >> allJSON: aDictionary
allJSON := aDictionary
"protocol: #accessing"
VRDataset >> allPapers
^ VRPaperGroup withAll: self paperDict values
"protocol: #accessing"
VRDataset >> selectedPapers
^ self allPapers select: #isDesignStudy
"protocol: #views"
VRDataset >> gtAllPapersFor: aView
<gtView>
^ aView forward
title: 'All papers';
priority: 10;
object: [ self allPapers ];
view: #gtItemsFor:
"protocol: #views"
VRDataset >> gtSelectedPapersFor: aView
<gtView>
^ aView forward
title: 'Selected papers';
priority: 20;
object: [ self selectedPapers ];
view: #gtItemsFor:
"protocol: #accessing"
VRDataset >> paperDict
^ paperDict
ifNil: [ paperDict := ((self allJSON collect: [ :d | VRPaper for: d in: self ])
collect: [ :p | p id -> p ]) asDictionary ]
"protocol: #querying"
VRDataset >> paperWithId: id
^ self paperDict at: id
"protocol: #examples"
VRDatasetExamples >> vrDataset
<gtExample>
^ VRDataset
forAllCSV: self datasetFolder / 'dataset-review-vissoft-all.csv'
andSelectedCSV: self datasetFolder / 'dataset-review-vissoft-selected.csv'
"protocol: #accessing"
VRDatasetExamples >> paper128
<gtExample>
| paper |
paper := self vrDataset paperWithId: 128.
self assert: paper id equals: 128.
self assert: paper isDesignStudy.
^ paper
Object << #VRPaper
slots: { #data . #dataset };
tag: 'Model';
package: 'VisualisationReview'.
Object class << VRPaper class
slots: {}
"protocol: #accessing"
VRPaper >> dataset
^ dataset
"protocol: #accessing"
VRPaper >> dataset: aDataset
dataset := aDataset
"protocol: #'instance creation'"
VRPaper class >> for: aDictionary in: aDataset
^ self new
data: aDictionary;
dataset: aDataset;
yourself
To sync to this point in the tutorial (throwing away any other changes) evaluate:
VRTutorialExamples new fileIn8Dataset
To extract the authors we need to do a bit of work, because there are several columns in the CSV, respectively attributes in the JSON representing authors. We need to extract all these names and create domain entities for them.
If we inspect the following expression, we find 12 author fields (including the typo “Nineth” :-).
VRDatasetExamples new paper128 data keys select: [:s | s endsWith: 'Author']
Here are the author fields in the correct order:
#('First Author' 'Second Author' 'Third Author' 'Fourth Author' 'Fifth Author' 'Sixth Author' 'Seven Author' 'Eighth Author' 'Nineth Author' 'Tenth Author' 'Eleventh Author' 'Twelfth Author')
⇒
Define a constant method authorFields for a VRPaper, returning the array of author fields in the correct order.
⇒
Prototype and extract a method authorNames that returns the array of non-empty author names for a paper.
Hint:
Inspect the paper128 example and prototype the method's code in the playground.
Use collect:thenSelect: to collect the fields, and then select the non-empty ones.
VRDatasetExamples new paper128
Note that the author names are surrounded by double quotes.
⇒
Use the trimBoth: method to remove the quotes from author names before returning them in authorNames.
Hint:
String>>#trimBoth:
takes a block as an argument that should return true for characters to be trimmed. We want to trim the character $".
⇒
Extra task: go back and fix the paper titles to also trim away the double quotes where they appear.
Now we can ask the question: Are the recorded numbers of authors consistent with the number of authors found for each paper?
VRDatasetExamples new vrDataset allPapers reject: [ :p | p authorNames size = (p data at: '# Authors') asNumber ]
Curiously, we find 4 errors in the dataset where the number of authors found does not match the number recorded in the dataset!
We can generate a list of all the authors in the dataset:
(VRDatasetExamples new vrDataset allPapers flatCollect: #authorNames) copyWithoutDuplicates sorted
That's fine, but we really want Author domain entities rather than just strings, so we can start to ask questions about the authors. We'll take the same approach as we did for the papers, having the dataset object create a dictionary of authors keyed by their names, and having each author hold a reference to the dataset so we can query it for an author's papers.
But wait — papers also have a dataset slot, and so will other domain entities, so before we create the new author class, let's do a little refactoring, and introduce VRDomainEntity as an abstract superclass of VRPaper.
⇒
Create the class VRDomainEntity as a Model class in the VisualisationReview package using the fixit tool.
⇒
Inspect paper 128 and change its superclass to VRDomainEntity in the
Meta
view.
⇒
Push up the dataset slot and its accessors to the new superclass.
Hint:
Right-click on the dataset slot to open a menu to push it up. Also right-click on the methods to push them up.
⇒
Also push up the class-side for:in: instance creation method.
VRDomainEntity
VRDatasetExamples new paper128
Now we can introduce authors as domain entities. An author knows their name, their dataset, and the papers they have authored. We'd like to be able to instantiate a single author object like this:
aDataset := VRDatasetExamples new vrDataset. aPaper := aDataset paperWithId: 128. aName := aPaper authorNames first. anAuthor := VRAuthor named: aName in: aDataset. anAuthor addPaper: aPaper. anAuthor
⇒
Create the VRAuthor class by fixing the above snippet. Create it as a subclass of VRDomainEntity so it will inherit the dataset slot and accessors. Add a slot name with accessors, as well as a slot papers.
⇒
Add an initialize method that sets the papers slot to a new OrderedCollection.
⇒
Fix the addPaper: method to simply add: the argument to the papers collection.
⇒
Evaluate the snippet to verify that the instance is initialized as expected.
Now we would like the dataset to maintain a dictionary of authors just as it has for papers, but the procedure will be a little different, because we have to iterate over the papers to create the dictionary
⇒
Add an initializeAuthorDict method to the dataset.
Hint:
Inspect the dataset example, and prototype the code to initialize the dictionary in an inspector playground. The code should look something like this:
VRDatasetExamples new vrDataset.
Now we just need to use it to actually initialize the slot.
⇒
Introduce a slot authorDict and add a getter (we don't need a setter).
⇒
Change the getter to lazily initialize the authorDict slot, like this:
NB:
we don't
have to
initialize the authorDict lazily, but we must do it
after
the allJSON slot has been set. Another place we could initialize the paper and author dictionaries is in the allJSON:selectedJSON: initialization method.
⇒
Inspect self authorDict values in the inspector playground to verify that it returns the array of authors.
⇒
Extract this code as a method authors.
Now we are at a point we saw before with papers: we have an author domain entity, but we haven't molded anything yet, so it just shows us plain author objects with default Raw and Print views. We'll fix that next.
VRDatasetExamples new vrDataset authors
Changes this time:
Object << #VRDomainEntity
slots: { #dataset };
tag: 'Model';
package: 'VisualisationReview'.
Object class << VRDomainEntity class
slots: {}
accessing
"protocol: #accessing"
VRDomainEntity >> dataset
^ dataset
"protocol: #accessing"
VRDomainEntity >> dataset: aDataset
dataset := aDataset
VRDomainEntity << #VRPaper
slots: { #data };
tag: 'Model';
package: 'VisualisationReview'.
VRDomainEntity class << VRPaper class
slots: {}
"protocol: #accessing"
VRPaper >> authorNames
^ (self authorFields collect: [ :f | self data at: f ] thenSelect: #isNotEmpty)
collect: [ :n | n trimBoth: [ :c | c == $" ] ]
constant
"protocol: #constant"
VRPaper >> authorFields
^ #('First Author' 'Second Author' 'Third Author' 'Fourth Author' 'Fifth Author' 'Sixth Author' 'Seven Author' 'Eighth Author' 'Nineth Author' 'Tenth Author' 'Eleventh Author' 'Twelfth Author')
VRDomainEntity << #VRAuthor
slots: { #name . #papers };
tag: 'Model';
package: 'VisualisationReview'.
VRDomainEntity class << VRAuthor class
slots: {}
accessing
"protocol: #accessing"
VRAuthor >> name
^ name
"protocol: #accessing"
VRAuthor >> name: aString
name := aString
"protocol: #accessing"
VRAuthor >> papers
^ papers
"protocol: #accessing"
VRAuthor >> addPaper: aPaper
papers add: aPaper
initialization
"protocol: #initialization"
VRAuthor >> initialize
papers := OrderedCollection new
accessing
"protocol: #accessing"
VRAuthor class >> named: aName in: aDataset
^ self new
name: aName;
dataset: aDataset;
yourself
"protocol: #initialization"
VRDataset >> initializeAuthorDict
| dict |
dict := Dictionary new.
self allPapers
do: [ :p |
p authorNames
do: [ :n | (dict at: n ifAbsentPut: [ VRAuthor named: n in: self ]) addPaper: p ] ].
^ dict
"protocol: #accessing"
VRDataset >> authorDict
^ authorDict ifNil: [ authorDict := self initializeAuthorDict ]
"protocol: #accessing"
VRDataset >> authors
^ self authorDict values
To sync to this point in the tutorial (throwing away any other changes) evaluate:
VRTutorialExamples new fileIn9Authors
We now have authors as domain entities, but nothing has been molded, so we only have the default Raw and Print views, and a list of authors shows us nothing of interest.
VRDatasetExamples new vrDataset authors
Luckily we have been through the molding process already with papers, so we can just follow the same steps here.
⇒
Inspect an author, and mold the
Print
view by implementing printOn: to display its name. Verify that a list of authors now displays authors by name.
VRDatasetExamples new vrDataset authors first
We have a dictionary of authors, but we aren't using it yet. We can retrieve an author by name as follows:
examples := VRDatasetExamples new. dataset := examples vrDataset. aName := 'Lanza, M.'. dataset authorDict at: aName.
⇒
Evaluate the above snippet, and then highlight the last line to extract a new dataset method authorNamed:.
⇒
Extract the whole snippet as a new example in VRDatasetExamples named authorLanza. Clean up the extracted method, and some suitable assertions.
VRDatasetExamples new authorLanza
⇒
Change the papers method of the author to return a VRPapersGroup instead of an array.
⇒
Extract a
Papers
forwarding view to self papers in an Inspector playground of the author example.
⇒
Add a
Summary
view to an author, showing the author's name and the number of papers published in the dataset.
Now we have the papers of an author, but what about the authors of a paper?
⇒
In the Inspector playground of paper128, prototype an expression to retrieve from the paper's dataset the authors corresponding to the authorNames of the paper. Extract this as a method authors.
VRDatasetExamples new paper128
You probably came up with something like this:
This is fine, but the result is an Array
, which is not moldable per se. Instead we would like an
author group
, just like the paper group we introduced earlier. This sounds like a good case for refactoring.
⇒
Introduce a new class VRDomainEntityGroup that uses the trait TGtGroupWithItems. Make VRPaperGroup a subclass of this new class, and remove its trait usage.
⇒
Introduce a new subclass of VRDomainEntityGroup called VRAuthorGroup.
⇒
Change the authors method of VRPaper to return a VRAuthorGroup of authors.
⇒
Add a gtAuthorsFor: view to the author group, displaying just the authors names for now.
⇒
Add a gtAuthorsFor: view to a paper, so we can see the full list of authors of a paper.
VRDatasetExamples new paper128
⇒
Change the authors method of a dataset to return a
sorted
group of authors.
⇒
Add an
Authors
view to the dataset.
⇒
Add a
# Papers
column to the
Authors
view.
VRDatasetExamples new vrDataset
We might well ask,
who are the co-authors a given author has worked with?
⇒
In an Inspector playground of an example author, prototype an expression to return the co-authors of an author.
Take care that the list of co-authors does not include the author themself!
⇒
Extract a method coauthors and an associated view.
⇒
Add the # of co-authors to the author summary view and to the author group overview.
VRDatasetExamples new authorLanza
⇒ Who is the author with the largest number of co-authors? The least?
Note how we can now navigate comfortably between papers, authors and co-authors, just by clicking between views.
"protocol: #printing"
VRAuthor >> printOn: aStream
aStream nextPutAll: self name
"protocol: #querying"
VRDataset >> authorNamed: aName
^ self authorDict at: aName
"protocol: #examples"
VRDatasetExamples >> authorLanza
<gtExample>
| dataset author |
dataset := self vrDataset.
author := dataset authorNamed: 'Lanza, M.'.
self assert: author name equals: 'Lanza, M.'.
self assert: author papers size equals: 18.
^ author
"protocol: #accessing"
VRAuthor >> papers
^ VRPaperGroup withAll: papers
"protocol: #views"
VRAuthor >> gtPapersFor: aView
<gtView>
^ aView forward
title: 'Papers';
priority: 10;
object: [ self papers ];
view: #gtItemsFor:
Object << #VRDomainEntityGroup
traits: {TGtGroupWithItems};
slots: {};
tag: 'Model';
package: 'VisualisationReview'.
Object class << VRDomainEntityGroup class
traits: {TGtGroupWithItems classTrait}
VRDomainEntityGroup << #VRPaperGroup
slots: {};
tag: 'Model';
package: 'VisualisationReview'.
VRDomainEntityGroup class << VRPaperGroup class
VRDomainEntityGroup << #VRAuthorGroup
slots: {};
tag: 'Model';
package: 'VisualisationReview'.
VRDomainEntityGroup class << VRAuthorGroup class
"protocol: #accessing"
VRPaper >> authors
^ VRAuthorGroup
withAll: (self authorNames collect: [ :n | self dataset authorNamed: n ])
"protocol: #views"
VRAuthorGroup >> gtAuthorsFor: aView
<gtView>
^ aView columnedList
title: 'Authors';
priority: 10;
items: [ self items ];
column: 'Index'
text: [ :eachItem :eachIndex | eachIndex asRopedText foreground: Color gray ]
width: 45;
column: 'Value' text: [ :each | each gtDisplayString ];
actionUpdateButton
"protocol: #views"
VRPaper >> gtAuthorsFor: aView
<gtView>
^ aView forward
title: 'Authors';
priority: 20;
object: [ self authors ];
view: #gtAuthorsFor:
"protocol: #views"
VRAuthor >> gtSummaryFor: aView
<gtView>
^ aView columnedList
title: 'Summary';
priority: 10;
items: [ {{'Name'.
self name}.
{'# Papers'.
self papers size}} ];
column: 'Key'
text: #first
width: 100;
column: 'Value' text: #second;
actionUpdateButton
"protocol: #accessing"
VRDataset >> authors
^ VRAuthorGroup withAll: (self authorDict values sortedAs: #name)
"protocol: #views"
VRDataset >> gtAuthorsFor: aView
<gtView>
^ aView forward
title: 'Authors';
priority: 30;
object: [ self authors ];
view: #gtAuthorsFor:
"protocol: #views"
VRAuthorGroup >> gtAuthorsFor: aView
<gtView>
^ aView columnedList
title: 'Authors';
priority: 10;
items: [ self items ];
column: 'Index'
text: [ :eachItem :eachIndex | eachIndex asRopedText foreground: Color gray ]
width: 45;
column: 'Author'
text: [ :each | each gtDisplayString ]
width: 150;
column: '# Papers'
text: [ :each | each papers size ]
width: 80;
actionUpdateButton
"protocol: #accessing"
VRAuthor >> coauthors
^ VRAuthorGroup
withAll: (((self papers flatCollect: #authors) copyWithout: self) copyWithoutDuplicates
sortedAs: #name)
"protocol: #views"
VRAuthor >> gtCoauthorsFor: aView
<gtView>
^ aView forward
title: 'Co-Authors';
priority: 30;
object: [ self coauthors ];
view: #gtAuthorsFor:
"protocol: #views"
VRAuthor >> gtSummaryFor: aView
<gtView>
^ aView columnedList
title: 'Summary';
priority: 10;
items: [ {{'Name'.
self name}.
{'# Papers'.
self papers size}.
{'# Co-authors'.
self coauthors size}} ];
column: 'Key'
text: #first
width: 100;
column: 'Value' text: #second;
actionUpdateButton
"protocol: #views"
VRAuthorGroup >> gtAuthorsFor: aView
<gtView>
^ aView columnedList
title: 'Authors';
priority: 10;
items: [ self items ];
column: 'Index'
text: [ :eachItem :eachIndex | eachIndex asRopedText foreground: Color gray ]
width: 45;
column: 'Author'
text: [ :each | each gtDisplayString ]
width: 150;
column: '# Papers'
text: [ :each | each papers size ]
width: 80;
column: '# Co-authors'
text: [ :each | each coauthors size ]
width: 80;
actionUpdateButton
To sync to this point in the tutorial (throwing away any other changes) evaluate:
VRTutorialExamples new fileIn10MoldingAuthors
NB: This page links to Pages containing missing references - allowed failures to allow references to missing classes and methods in this page. Next: Part 4. Molding CSV data — actions, queries and visualizations