Molding CSV data — modeling other entities

This is Part 3 of Tutorial: molding CSV data into an explainable system. In this part, we continue by modeling and molding further domain entities, in particular a dataset as an entity that provides a simple API for finding papers and other domain objects, and authors who contribute to writing papers.

If you are jumping into the tutorial at this point, and have some previously saved work not already in this image, you should file it in now. Otherwise, you can file in a sample snapshot reflecting the work completed at the end of the previous part:

VRTutorialExamples new fileIn7PaperGroup

Modeling the dataset

There are several more domain entities we would like to make explicit: authors, venues, and design studies. We also want to make sure that there is only one instance of each in an instantiated model of a dataset, so the same author doesn't appear multiple times. We can therefore introduce a Dataset entity that wraps the source CSV files, and keeps track of the instances we create. Instances of domain entities should know their dataset, so they can query it. Let's start by introducing the Dataset entity, and have it keep track of papers by ID, using a dictionary.

⇒ Create the VRDataset class with instance creation method forAllCSV:andSelectedCSV:. ⇒ Have this method convert the CSV files to JSON and store them using an instance initialization method allJSON:selectedJSON:. ⇒ Introduce slots allJSON and selectedJSON and their accessors, and use them in the initialization method. ⇒ Extract the code below as an example vrDataset.

VRDataset
	forAllCSV: VRDatasetExamples new datasetFolder / 'dataset-review-vissoft-all.csv'
	andSelectedCSV: VRDatasetExamples new datasetFolder / 'dataset-review-vissoft-selected.csv'

VRDatasetExamples new vrDataset

So far so good. Now we have to create and track the paper entities. ⇒ In the contextual playground of the dataset, evaluate this code:

This is an array of all the papers. We want (1) each paper to know which dataset it comes from, so it can use it to pose queries about other entities, and (2) we want to build a dictionary of these papers keyed by ID. ⇒ Adapt the instance creation method to add a new parameter for the dataset:

⇒ Fix the new method to look like this:

⇒ Continue to fix this method, adding the dataset slot and accessors. ⇒ Adapt the playground snippet to create a dictionary instead of an array of papers, for example, like this:

⇒ Extract this as a method paperDict. ⇒ Edit and fix the new paperDict method to cache the result in a paperDict slot:

Note that the paperDict slot is now lazily initialized upon its first access.

⇒ Prototype and extract an allPapers method. Hint: Just ask the dictionary for its values. Wrap the resulting array as a VRPaperGroup. ⇒ Extract an All Papers view for the dataset. Hint: In the contextual playground, evaluate self allPapers, and primary click to select Create <gtView> forward to Papers. Adapt the generated method name and the title to reflect the content (all papers). ⇒ Do the same for selected papers: prototype and extract a selectedPapers method, and extract a view for it. Hint: apply a select: query to the allPapers group. ⇒ Add priority: specifications to the All Papers and Selected Papers views, e.g., set them to 10 and 20, respectively. Hint: Secondary-click on the tabs to directly access their source code, without having to switch to the Meta view.

Now we have a dataset entity that holds a dictionary of papers, and provides two contextual views, one showing all the papers, and the second showing just the selected design study papers.

VRDatasetExamples new vrDataset

We still need a way to use the paper dictionary. ⇒ Evaluate the first snippet to assign the dataset and id variables. ⇒ Evaluate and inspect the second snippet. ⇒ Extract the second snippet as a method paperWithId:

dataset := VRDatasetExamples new vrDataset.
id := 128.

dataset paperDict at: id.

⇒ Rewrite the paper128 example to retrieve the paper from the vrDataset example using the new paperWithId: API

VRDatasetExamples new paper128

Here are all the changes so far:

Object << #VRDataset
	slots: { #allJSON . #selectedJSON . #paperDict . #authorDict };
	tag: 'Model';
	package: 'VisualisationReview'.

Object class << VRDataset class
	slots: {}

"protocol: #'instance creation'"

VRDataset class >> forAllCSV: aCSVfile andSelectedCSV: anotherCSVFile
	^ self new
		allJSON: (CSV2JSON for: aCSVfile contents) jsonObject
		selectedJSON: (CSV2JSON for: anotherCSVFile contents) jsonObject

"protocol: #initialization"

VRDataset >> allJSON: aDictionary selectedJSON: anotherDictionary
	self allJSON: aDictionary.
	self selectedJSON: anotherDictionary

"protocol: #accessing"

VRDataset >> selectedJSON
	^ selectedJSON

"protocol: #accessing"

VRDataset >> allJSON
	^ allJSON

"protocol: #accessing"

VRDataset >> selectedJSON: aDictionary
	selectedJSON := aDictionary

"protocol: #accessing"

VRDataset >> allJSON: aDictionary
	allJSON := aDictionary

"protocol: #accessing"

VRDataset >> allPapers
	^ VRPaperGroup withAll: self paperDict values

"protocol: #accessing"

VRDataset >> selectedPapers
	^ self allPapers select: #isDesignStudy

"protocol: #views"

VRDataset >> gtAllPapersFor: aView
	<gtView>
	^ aView forward
		title: 'All papers';
		priority: 10;
		object: [ self allPapers ];
		view: #gtItemsFor:

"protocol: #views"

VRDataset >> gtSelectedPapersFor: aView
	<gtView>
	^ aView forward
		title: 'Selected papers';
		priority: 20;
		object: [ self selectedPapers ];
		view: #gtItemsFor:

"protocol: #accessing"

VRDataset >> paperDict
	^ paperDict
		ifNil: [ paperDict := ((self allJSON collect: [ :d | VRPaper for: d in: self ])
					collect: [ :p | p id -> p ]) asDictionary ]

"protocol: #querying"

VRDataset >> paperWithId: id
	^ self paperDict at: id

"protocol: #examples"

VRDatasetExamples >> vrDataset
	<gtExample>
	^ VRDataset
		forAllCSV: self datasetFolder / 'dataset-review-vissoft-all.csv'
		andSelectedCSV: self datasetFolder / 'dataset-review-vissoft-selected.csv'

"protocol: #accessing"

VRDatasetExamples >> paper128
	<gtExample>
	| paper |
	paper := self vrDataset paperWithId: 128.
	self assert: paper id equals: 128.
	self assert: paper isDesignStudy.
	^ paper

Object << #VRPaper
	slots: { #data . #dataset };
	tag: 'Model';
	package: 'VisualisationReview'.

Object class << VRPaper class
	slots: {}

"protocol: #accessing"

VRPaper >> dataset
	^ dataset

"protocol: #accessing"

VRPaper >> dataset: aDataset
	dataset := aDataset

"protocol: #'instance creation'"

VRPaper class >> for: aDictionary in: aDataset
	^ self new
		data: aDictionary;
		dataset: aDataset;
		yourself

To sync to this point in the tutorial (throwing away any other changes) evaluate:

VRTutorialExamples new fileIn8Dataset

Modeling authors

To extract the authors we need to do a bit of work, because there are several columns in the CSV, respectively attributes in the JSON representing authors. We need to extract all these names and create domain entities for them. If we inspect the following expression, we find 12 author fields (including the typo “Nineth” :-).

VRDatasetExamples new paper128 data keys select: [:s | s endsWith: 'Author']

Here are the author fields in the correct order:

#('First Author' 'Second Author' 'Third Author' 'Fourth Author' 'Fifth Author' 'Sixth Author' 'Seven Author' 'Eighth Author' 'Nineth Author' 'Tenth Author' 'Eleventh Author' 'Twelfth Author')

⇒ Define a constant method authorFields for a VRPaper, returning the array of author fields in the correct order. ⇒ Prototype and extract a method authorNames that returns the array of non-empty author names for a paper. Hint: Inspect the paper128 example and prototype the method's code in the playground. Use collect:thenSelect: to collect the fields, and then select the non-empty ones.

VRDatasetExamples new paper128

Note that the author names are surrounded by double quotes. ⇒ Use the trimBoth: method to remove the quotes from author names before returning them in authorNames. Hint: String>>#trimBoth: trimBoth: aBlock "Trim characters satisfying the condition given in aBlock from both sides of the receiving string." ^ self trimLeft: aBlock right: aBlock takes a block as an argument that should return true for characters to be trimmed. We want to trim the character $". ⇒ Extra task: go back and fix the paper titles to also trim away the double quotes where they appear.

Now we can ask the question: Are the recorded numbers of authors consistent with the number of authors found for each paper?

VRDatasetExamples new vrDataset allPapers
	reject: [ :p | p authorNames size = (p data at: '# Authors') asNumber ]

Curiously, we find 4 errors in the dataset where the number of authors found does not match the number recorded in the dataset!

We can generate a list of all the authors in the dataset:

(VRDatasetExamples new vrDataset allPapers flatCollect: #authorNames)
	copyWithoutDuplicates sorted

That's fine, but we really want Author domain entities rather than just strings, so we can start to ask questions about the authors. We'll take the same approach as we did for the papers, having the dataset object create a dictionary of authors keyed by their names, and having each author hold a reference to the dataset so we can query it for an author's papers.

But wait — papers also have a dataset slot, and so will other domain entities, so before we create the new author class, let's do a little refactoring, and introduce VRDomainEntity as an abstract superclass of VRPaper. ⇒ Create the class VRDomainEntity as a Model class in the VisualisationReview package using the fixit tool. ⇒ Inspect paper 128 and change its superclass to VRDomainEntity in the Meta view. ⇒ Push up the dataset slot and its accessors to the new superclass. Hint: Right-click on the dataset slot to open a menu to push it up. Also right-click on the methods to push them up. ⇒ Also push up the class-side for:in: instance creation method.

VRDomainEntity

VRDatasetExamples new paper128

Now we can introduce authors as domain entities. An author knows their name, their dataset, and the papers they have authored. We'd like to be able to instantiate a single author object like this:

aDataset := VRDatasetExamples new vrDataset.
aPaper := aDataset paperWithId: 128.
aName := aPaper authorNames first.
anAuthor := VRAuthor named: aName in: aDataset.
anAuthor addPaper: aPaper.
anAuthor

⇒ Create the VRAuthor class by fixing the above snippet. Create it as a subclass of VRDomainEntity so it will inherit the dataset slot and accessors. Add a slot name with accessors, as well as a slot papers. ⇒ Add an initialize method that sets the papers slot to a new OrderedCollectionSequenceableCollection << #OrderedCollection slots: { #array . #firstIndex . #lastIndex }; tag: 'Ordered'; package: 'Collections-Sequenceable'. ⇒ Fix the addPaper: method to simply add: the argument to the papers collection. ⇒ Evaluate the snippet to verify that the instance is initialized as expected.

Now we would like the dataset to maintain a dictionary of authors just as it has for papers, but the procedure will be a little different, because we have to iterate over the papers to create the dictionary ⇒ Add an initializeAuthorDict method to the dataset. Hint: Inspect the dataset example, and prototype the code to initialize the dictionary in an inspector playground. The code should look something like this:

VRDatasetExamples new vrDataset.

Now we just need to use it to actually initialize the slot. ⇒ Introduce a slot authorDict and add a getter (we don't need a setter). ⇒ Change the getter to lazily initialize the authorDict slot, like this:

NB: we don't have to initialize the authorDict lazily, but we must do it after the allJSON slot has been set. Another place we could initialize the paper and author dictionaries is in the allJSON:selectedJSON: initialization method. ⇒ Inspect self authorDict values in the inspector playground to verify that it returns the array of authors. ⇒ Extract this code as a method authors.

Now we are at a point we saw before with papers: we have an author domain entity, but we haven't molded anything yet, so it just shows us plain author objects with default Raw and Print views. We'll fix that next.

VRDatasetExamples new vrDataset authors

Changes this time:

Object << #VRDomainEntity
	slots: { #dataset };
	tag: 'Model';
	package: 'VisualisationReview'.

Object class << VRDomainEntity class
	slots: {}

accessing

"protocol: #accessing"

VRDomainEntity >> dataset
	^ dataset

"protocol: #accessing"

VRDomainEntity >> dataset: aDataset
	dataset := aDataset

VRDomainEntity << #VRPaper
	slots: { #data };
	tag: 'Model';
	package: 'VisualisationReview'.

VRDomainEntity class << VRPaper class
	slots: {}

"protocol: #accessing"

VRPaper >> authorNames
	^ (self authorFields collect: [ :f | self data at: f ] thenSelect: #isNotEmpty)
		collect: [ :n | n trimBoth: [ :c | c == $" ] ]

constant

"protocol: #constant"

VRPaper >> authorFields
	^ #('First Author' 'Second Author' 'Third Author' 'Fourth Author' 'Fifth Author' 'Sixth Author' 'Seven Author' 'Eighth Author' 'Nineth Author' 'Tenth Author' 'Eleventh Author' 'Twelfth Author')

VRDomainEntity << #VRAuthor
	slots: { #name . #papers };
	tag: 'Model';
	package: 'VisualisationReview'.

VRDomainEntity class << VRAuthor class
	slots: {}

accessing

"protocol: #accessing"

VRAuthor >> name
	^ name

"protocol: #accessing"

VRAuthor >> name: aString
	name := aString

"protocol: #accessing"

VRAuthor >> papers
	^ papers

"protocol: #accessing"

VRAuthor >> addPaper: aPaper
	papers add: aPaper

initialization

"protocol: #initialization"

VRAuthor >> initialize
	papers := OrderedCollection new

accessing

"protocol: #accessing"

VRAuthor class >> named: aName in: aDataset
	^ self new
		name: aName;
		dataset: aDataset;
		yourself

"protocol: #initialization"

VRDataset >> initializeAuthorDict
	| dict |
	dict := Dictionary new.
	self allPapers
		do: [ :p | 
			p authorNames
				do: [ :n | (dict at: n ifAbsentPut: [ VRAuthor named: n in: self ]) addPaper: p ] ].
	^ dict

"protocol: #accessing"

VRDataset >> authorDict
	^ authorDict ifNil: [ authorDict := self initializeAuthorDict ]

"protocol: #accessing"

VRDataset >> authors
	^ self authorDict values

To sync to this point in the tutorial (throwing away any other changes) evaluate:

VRTutorialExamples new fileIn9Authors

Molding authors

We now have authors as domain entities, but nothing has been molded, so we only have the default Raw and Print views, and a list of authors shows us nothing of interest.

VRDatasetExamples new vrDataset authors

Luckily we have been through the molding process already with papers, so we can just follow the same steps here. ⇒ Inspect an author, and mold the Print view by implementing printOn: to display its name. Verify that a list of authors now displays authors by name.

VRDatasetExamples new vrDataset authors first

We have a dictionary of authors, but we aren't using it yet. We can retrieve an author by name as follows:

examples := VRDatasetExamples new.
dataset := examples vrDataset.
aName := 'Lanza, M.'.
dataset authorDict at: aName.

⇒ Evaluate the above snippet, and then highlight the last line to extract a new dataset method authorNamed:. ⇒ Extract the whole snippet as a new example in VRDatasetExamples named authorLanza. Clean up the extracted method, and some suitable assertions.

VRDatasetExamples new authorLanza

⇒ Change the papers method of the author to return a VRPapersGroup instead of an array. ⇒ Extract a Papers forwarding view to self papers in an Inspector playground of the author example. ⇒ Add a Summary view to an author, showing the author's name and the number of papers published in the dataset.

Now we have the papers of an author, but what about the authors of a paper? ⇒ In the Inspector playground of paper128, prototype an expression to retrieve from the paper's dataset the authors corresponding to the authorNames of the paper. Extract this as a method authors.

VRDatasetExamples new paper128

You probably came up with something like this:

This is fine, but the result is an Array ArrayedCollection << #Array layout: VariableLayout; slots: {}; tag: 'Base'; package: 'Collections-Sequenceable' , which is not moldable per se. Instead we would like an author group , just like the paper group we introduced earlier. This sounds like a good case for refactoring. ⇒ Introduce a new class VRDomainEntityGroup that uses the trait TGtGroupWithItemsTrait << #TGtGroupWithItems traits: {TGtGroup + TGtGroupItems}; slots: {}; package: 'GToolkit-Utility-System'. Make VRPaperGroup a subclass of this new class, and remove its trait usage. ⇒ Introduce a new subclass of VRDomainEntityGroup called VRAuthorGroup. ⇒ Change the authors method of VRPaper to return a VRAuthorGroup of authors. ⇒ Add a gtAuthorsFor: view to the author group, displaying just the authors names for now. ⇒ Add a gtAuthorsFor: view to a paper, so we can see the full list of authors of a paper.

VRDatasetExamples new paper128

⇒ Change the authors method of a dataset to return a sorted group of authors. ⇒ Add an Authors view to the dataset. ⇒ Add a # Papers column to the Authors view.

VRDatasetExamples new vrDataset

We might well ask, who are the co-authors a given author has worked with? ⇒ In an Inspector playground of an example author, prototype an expression to return the co-authors of an author. Take care that the list of co-authors does not include the author themself! ⇒ Extract a method coauthors and an associated view. ⇒ Add the # of co-authors to the author summary view and to the author group overview.

VRDatasetExamples new authorLanza

⇒ Who is the author with the largest number of co-authors? The least?

Note how we can now navigate comfortably between papers, authors and co-authors, just by clicking between views.

"protocol: #printing"

VRAuthor >> printOn: aStream
	aStream nextPutAll: self name

"protocol: #querying"

VRDataset >> authorNamed: aName
	^ self authorDict at: aName

"protocol: #examples"

VRDatasetExamples >> authorLanza
	<gtExample>
	| dataset author |
	dataset := self vrDataset.
	author := dataset authorNamed: 'Lanza, M.'.
	self assert: author name equals: 'Lanza, M.'.
	self assert: author papers size equals: 18.
	^ author

"protocol: #accessing"

VRAuthor >> papers
	^ VRPaperGroup withAll: papers

"protocol: #views"

VRAuthor >> gtPapersFor: aView
	<gtView>
	^ aView forward
		title: 'Papers';
		priority: 10;
		object: [ self papers ];
		view: #gtItemsFor:

Object << #VRDomainEntityGroup
	traits: {TGtGroupWithItems};
	slots: {};
	tag: 'Model';
	package: 'VisualisationReview'.

Object class << VRDomainEntityGroup class
	traits: {TGtGroupWithItems classTrait}

VRDomainEntityGroup << #VRPaperGroup
	slots: {};
	tag: 'Model';
	package: 'VisualisationReview'.

VRDomainEntityGroup class << VRPaperGroup class

VRDomainEntityGroup << #VRAuthorGroup
	slots: {};
	tag: 'Model';
	package: 'VisualisationReview'.

VRDomainEntityGroup class << VRAuthorGroup class

"protocol: #accessing"

VRPaper >> authors
	^ VRAuthorGroup
		withAll: (self authorNames collect: [ :n | self dataset authorNamed: n ])

"protocol: #views"

VRAuthorGroup >> gtAuthorsFor: aView
	<gtView>
	^ aView columnedList
		title: 'Authors';
		priority: 10;
		items: [ self items ];
		column: 'Index'
			text: [ :eachItem :eachIndex | eachIndex asRopedText foreground: Color gray ]
			width: 45;
		column: 'Value' text: [ :each | each gtDisplayString ];
		actionUpdateButton

"protocol: #views"

VRPaper >> gtAuthorsFor: aView
	<gtView>
	^ aView forward
		title: 'Authors';
		priority: 20;
		object: [ self authors ];
		view: #gtAuthorsFor:

"protocol: #views"

VRAuthor >> gtSummaryFor: aView
	<gtView>
	^ aView columnedList
		title: 'Summary';
		priority: 10;
		items: [ {{'Name'.
					self name}.
				{'# Papers'.
					self papers size}} ];
		column: 'Key'
			text: #first
			width: 100;
		column: 'Value' text: #second;
		actionUpdateButton

"protocol: #accessing"

VRDataset >> authors
	^ VRAuthorGroup withAll: (self authorDict values sortedAs: #name)

"protocol: #views"

VRDataset >> gtAuthorsFor: aView
	<gtView>
	^ aView forward
		title: 'Authors';
		priority: 30;
		object: [ self authors ];
		view: #gtAuthorsFor:

"protocol: #views"

VRAuthorGroup >> gtAuthorsFor: aView
	<gtView>
	^ aView columnedList
		title: 'Authors';
		priority: 10;
		items: [ self items ];
		column: 'Index'
			text: [ :eachItem :eachIndex | eachIndex asRopedText foreground: Color gray ]
			width: 45;
		column: 'Author'
			text: [ :each | each gtDisplayString ]
			width: 150;
		column: '# Papers'
			text: [ :each | each papers size ]
			width: 80;
		actionUpdateButton

"protocol: #accessing"

VRAuthor >> coauthors
	^ VRAuthorGroup
		withAll: (((self papers flatCollect: #authors) copyWithout: self) copyWithoutDuplicates
				sortedAs: #name)

"protocol: #views"

VRAuthor >> gtCoauthorsFor: aView
	<gtView>
	^ aView forward
		title: 'Co-Authors';
		priority: 30;
		object: [ self coauthors ];
		view: #gtAuthorsFor:

"protocol: #views"

VRAuthor >> gtSummaryFor: aView
	<gtView>
	^ aView columnedList
		title: 'Summary';
		priority: 10;
		items: [ {{'Name'.
					self name}.
				{'# Papers'.
					self papers size}.
				{'# Co-authors'.
					self coauthors size}} ];
		column: 'Key'
			text: #first
			width: 100;
		column: 'Value' text: #second;
		actionUpdateButton

"protocol: #views"

VRAuthorGroup >> gtAuthorsFor: aView
	<gtView>
	^ aView columnedList
		title: 'Authors';
		priority: 10;
		items: [ self items ];
		column: 'Index'
			text: [ :eachItem :eachIndex | eachIndex asRopedText foreground: Color gray ]
			width: 45;
		column: 'Author'
			text: [ :each | each gtDisplayString ]
			width: 150;
		column: '# Papers'
			text: [ :each | each papers size ]
			width: 80;
		column: '# Co-authors'
			text: [ :each | each coauthors size ]
			width: 80;
		actionUpdateButton

To sync to this point in the tutorial (throwing away any other changes) evaluate:

VRTutorialExamples new fileIn10MoldingAuthors

NB: This page links to Pages containing missing references - allowed failures to allow references to missing classes and methods in this page. Next: Part 4. Molding CSV data — actions, queries and visualizations