Creating SmaCC parsers by example
On this page, we have some simple example parsers that show parts of the SmaCC grammar. As our first example, we create a simple csv parser that just returns the collection of tokens. It doesn't handle quoted fields, but that could be added for a more complex regular expression for the <value> token.
simpleCsv
<gtExample>
<after: #cleanUp>
| grammar value |
grammar := '
<value> : [^\,\r\n];
<eol> : \r?\n | \r;
File
: (Line? <eol>)* Line?
;
Line
: Value ("," Value)*
;
Value : <value>;
'.
self compileParser: grammar.
self assert: (self parserClass parse: '') equals: OrderedCollection new.
self assert: (self parserClass parse: '\') size equals: 1.
value := self parserClass parse: '1,2,3,4\\5,6,7,8' withCRs.
self assert: value size equals: 16.
^ value
The previous example isn't that useful since it is just returning the tokens. It can be used to verify that some string is in the valid format, but not much else. In the next example, we modify the grammar to return a collection of lines where each line is a collection of strings for each value. Since we are using Smalltalk expressions {} in our grammar, we need to modify the grammar to remove the * and + expressions. When generating the grammar, these are transformed into production rules so any variable declarations inside them will not be available to the outside.
simpleCsvAsValues
<gtExample>
<after: #cleanUp>
| grammar value |
grammar := '
<value> : [^\,\r\n];
<eol> : \r?\n | \r;
File
: Lines {''1''}
;
Lines
: Line {OrderedCollection with: ''1''}
| Lines ''lines'' <eol> Line ''line'' {lines add: line; yourself}
;
Line
: Values {''1'' contents}
| {#()}
;
Values
: Values ''stream'' "," Value ''value'' {stream nextPut: value; yourself}
| Value { {} writeStream nextPut: ''1''; yourself}
;
Value : <value> {''1'' value};
'.
self compileParser: grammar.
self assert: (self parserClass parse: '') equals: (OrderedCollection with: #()).
self
assert: (self parserClass parse: '\' withCRs)
equals: (OrderedCollection with: #() with: #()).
value := self parserClass parse: '\1,2,3,4\\5,6,7,8' withCRs.
self assert: value size equals: 4.
self assert: value first equals: #().
self assert: value second equals: #('1' '2' '3' '4').
self assert: value third equals: #().
self assert: value fourth equals: #('5' '6' '7' '8').
^ value
As an alternative, we can also modify the grammar to produce an AST. Since we are generating an AST, we can simply take the original grammar, add variable names for the items we want in the AST, add {{}} node definitions, and add the %root directive that specifies the root node for the AST. In the example, we also specify a %prefix and %suffix for the AST node class names. In the example, we use {{}} for the class names as the productions are already named as we want the classes named. Also, in the example, we use the singular variable names line and value. SmaCC sees that these can occur multiple times in the node definition and converts them into plural names lines and values.
simpleCsvAST
<gtExample>
<after: #cleanUp>
| grammar value |
grammar := '
%root Root;
%prefix GtCSV;
%suffix Node;
<value> : [^\,\r\n];
<eol> : \r?\n | \r;
File
: (Line ''line''? <eol>)* Line ''line''? {{}}
;
Line
: Value ''value'' ("," ''comma'' Value ''value'')* {{}}
;
Value : <value> ''string'' {{}};
'.
self compileParser: grammar.
value := self parserClass parse: '1,2,3,4\\5,6,7,8' withCRs.
self assert: value class name equals: #GtCSVFileNode.
self assert: value lines size equals: 2.
self assert: value lines first class name equals: #GtCSVLineNode.
self assert: value lines first values size equals: 4.
self assert: value lines first values first class name equals: #GtCSVValueNode.
self assert: value lines first values first string value equals: '1'.
^ value
For the next example, we create a simple calculator. We use the code expressions {} to evaluate the expressions. In the parser we allow whitespace to appear between the numbers and operators. Since the <whitespace> token has the same name as the SmaCCScanner>>#whitespace
method, the method is executed when the token is matched. The whitespace method just ignores the token. The calculator follows standard precedence rules where * and / are evaluated before + and -. We do that by using the %left directive. Since the %left "+" "-"; directive appears before the %left "*" "/"; directive, the + and - tokens have lower precedence than the * and / tokens. All of these are parsed using left associativity. The %left, %right and %nonassoc directives only work if the productions that cause ambiguity contains the token in the directive. For example, if we used Expression Op Expression with Op defined as "+" | "-" | "*" | "/", the directives would not work.
arithmeticEvaluation
<gtExample>
<after: #cleanUp>
| grammar value |
grammar := '
<number> : \d+ (\. \d+)?;
<whitespace> : \s+;
%left "+" "-";
%left "*" "/";
Expression
: Expression "+" Expression {''1'' + ''3''}
| Expression ''left'' "-" Expression ''right'' {left - right}
| Expression ''left'' "*" Expression ''right'' {left * right}
| Expression ''left'' "/" Expression ''right'' {left / right}
| "(" Expression ")" {''2''}
| <number> {''1'' value asNumber}
;
'.
self compileParser: grammar.
value := self parserClass parse: '1 + 2 - 3 * 4 / 5'.
self assert: value equals: 3 / 5.
^ value
We can modify the previous example grammar to create an AST instead of evaluating the expression. Like our previous AST generation example, we annotate the grammar with variable and class names and provide the %root, %prefix and %suffix directives. This example also uses the %hierarchy directory to define an Expression node class that is a superclass of Number and BinaryExpression. This %hierarchy directive helps when generating the code for the "(" ''leftParen'' Expression ")" ''rightParen'' {{}} reduction. In this line, the Expression does not contain a variable name so normally it would be dropped from the AST. However, there is a special case where if an unnamed production returns an AST node that is the same or sub type of type the production is returning, then that unnamed production item is returned instead of creating a new node. In this particular example, the leftParen and rightParen are added to the node that Expression returns. SmaCC also figures out that there can be multiple leftParen and rightParen objects so adds leftParens and rightParens variables to the Expression class.
astGeneration
<gtExample>
<after: #cleanUp>
| grammar value |
grammar := '
%root Root;
%prefix GtEx;
%suffix Node;
%hierarchy Expression (Number BinaryExpression);
<number> : \d+ (\. \d+)?;
<whitespace> : \s+;
%left "+" "-";
%left "*" "/";
Script
: Statement
| Expression
;
Statement
: Expression ''expression'' ";" ''semicolon'' {{}}
;
Expression
: Expression ''left'' "+" ''operator'' Expression ''right'' {{BinaryExpression}}
| Expression ''left'' "-" ''operator'' Expression ''right'' {{BinaryExpression}}
| Expression ''left'' "*" ''operator'' Expression ''right'' {{BinaryExpression}}
| Expression ''left'' "/" ''operator'' Expression ''right'' {{BinaryExpression}}
| "(" ''leftParen'' Expression ")" ''rightParen'' {{}}
| <number> ''value'' {{Number}}
;
'.
self compileParser: grammar.
self
assert: #GtExBinaryExpressionNode asClass superclass name
equals: #GtExExpressionNode.
self assert: #GtExExpressionNode asClass superclass name equals: #GtExRootNode.
self assert: #GtExStatementNode asClass superclass name equals: #GtExRootNode.
self
assert: #GtExExpressionNode asClass instVarNames asSortedCollection asArray
equals: #('leftParens' 'rightParens').
value := self parserClass parse: '((1 + 2)) - 3 * 4 / 5 ;'.
self assert: value class name equals: #GtExStatementNode.
self assert: value expression class name equals: #GtExBinaryExpressionNode.
self assert: value expression operator value equals: '-'.
self assert: value expression left leftParens size equals: 2.
self assert: value expression right operator value equals: '/'.
^ value
When using a GLR parser, it may be the case that multiple potential parses can occur at the same time. Should we immediately evaluate the Smalltalk expression when we reduce a production even if that production may not be used in a valid parse? {} expressions are only evaluated when we know that this is the only reduction possible, and [] expressions are always evaluated even if it may never be used in a valid parse. For example, consider the following two examples. Both implement a simple calculator that does +, -, and /. Depending on the input, the expressions are ever evaluated left-to-right or right-to-left. If the expression ends with |< then the expression is evaluated left-to-right, but if it ends with |> then it is right-to-left. Evaluating 1 / 2 - 2 left-to-right produces a value, but evaluting right-to-left throw a divide by zero error. In the first example, we use {} code expressions, so we can evaluate 1 / 2 - 2 |<, but evaluating 1 / 2 - 2 |> throws an error.
glrEvaluation
"Code in the {} blocks is evaluated only when we know this production will be used."
<gtExample>
<after: #cleanUp>
| grammar value |
grammar := '
%glr;
<number> : \d+ (\. \d+)?;
<whitespace> : \s+;
Expression
: Left "|" "<" {''1''}
| Right "|" ">" {''1''}
;
Left
: Left "+" Number {''1'' + ''3''}
| Left "-" Number {''1'' - ''3''}
| Left "/" Number {''1'' / ''3''}
| Number {''1''}
;
Right
: Number "+" Right {''1'' + ''3''}
| Number "-" Right {''1'' - ''3''}
| Number "/" Right {''1'' / ''3''}
| Number {''1''}
;
Number
: <number> {''1'' value asNumber}
;
'.
self compileParser: grammar.
value := self parserClass parse: '1 / 2 - 2 | <'.
self assert: value equals: -3 / 2.
[ self parserClass parse: '1 / 2 - 2 | >'.
self assert: false ] on: ZeroDivide do: [ :ex | ].
^ value
In the next example, we use [] expressions, so the code is evaluated as soon as possible even if that parse ends in failure. In this case, both 1 / 2 - 2 |< and 1 / 2 - 2 |> throw an error even though the first should return a value.
glrImmediateEvaluation
"Code in the [] blocks is evaluated immediately and may not be used in the final parse."
<gtExample>
<after: #cleanUp>
| grammar value |
grammar := '
%glr;
<number> : \d+ (\. \d+)?;
<whitespace> : \s+;
Expression
: Left "|" "<" [''1'']
| Right "|" ">" [''1'']
;
Left
: Left "+" Number [''1'' + ''3'']
| Left "-" Number [''1'' - ''3'']
| Left "/" Number [''1'' / ''3'']
| Number [''1'']
;
Right
: Number "+" Right [''1'' + ''3'']
| Number "-" Right [''1'' - ''3'']
| Number "/" Right [''1'' / ''3'']
| Number [''1'']
;
Number
: <number> [''1'' value asNumber]
;
'.
self compileParser: grammar.
value := self parserClass parse: '1 / 2 - 3 | <'.
self assert: value equals: -5 / 2.
value := self parserClass parse: '1 / 2 - 3 | >'.
self assert: value equals: -1.
[ self parserClass parse: '1 / 2 - 2 | <'. "Unlike the {} blocks, the [] block will throw an error here since the division is immediately evaluated before we know if we are parsing the expression as left or right associative"
self assert: false ] on: ZeroDivide do: [ :ex | ].
[ self parserClass parse: '1 / 2 - 2 | >'.
self assert: false ] on: ZeroDivide do: [ :ex | ].
^ grammar