SmaCC grammar
SmaCC is a GLR parser generator. It generates both scanner and parser classes from a single definition. The scanner class transforms a string or stream of characters into a stream of tokens. The parser uses the tokens to create abstract syntax tree (AST) or whatever object that is defined by the grammar.
The scanner grammar is defined by token rules:
file:///TokenName : RegularExpression ;
Each token has a TokenName that is a valid Smalltalk variable name and must be surrounded by the <> characters. After the <TokenName> is a colon (:) character. Every token rule must end with a semicolon (;). RegularExpression is a regular expressions that matches one or more characters. There are many different syntaxes for regular expressions. SmaCC uses the following syntax for its regular expressions:
\character : Matches a special character. The character immediately following the backslash is matched exactly, unless it is a letter. Backslash-letter combinations have other meanings and are specified below.
\cLetter : Matches a control character. Control characters are the first 26 characters (e.g., \cA equals Character value: 0). The letter that follows the \c must be an uppercase letter.
\d : Matches a digit, 0-9.
\D : Matches anything that is not a digit.
\f : Matches a form-feed character, Character value: 12.
\n : Matches a newline character, Character value: 10.
\r : Matches a carriage return character, Character value: 13.
\s : Matches any whitespace character, [ \f\n\r\t\v].
\S : Matches any non-whitespace character.
\t : Matches a tab, Character value: 9.
\v : Matches a vertical tab, Character value: 11.
\w : Matches any letter, number or underscore, [A-Za-z0-9_].
\W : Matches anything that is not a letter, number or underscore.
\xHexNumber : Matches a character specified by the hex number following the \x. For example, \x20 matches the space character (Character value: 16r20), and \x1FFF matches Character value: 16r1FFF.
<token> : Copies the definition of <token> into the current regular expression. For example, if we have <hexdigit> : \d | [A-F] ;, we can use <hexdigit> in a later rule: <hexnumber> : <hexdigit> + ;.
<isMethod> : Copies the characters where Character>>isMethod returns true into the current regular expression. For example, instead of using \d, we could use <isDigit> since Character>>#isDigit
returns true for digits.
[characters] : Matches one of the characters inside the []. This is a shortcut for the | operator. In addition to single characters, you can also specify character ranges with the - character. For example, [a-z] matches any lower case letter.
[^characters] : Matches any character not listed in the characters block. [^a] matches anything except for a.
[[boolean expression]] : Evaluates the code inside the [[]]. If the code returns true then scanning of this regular expression continues. This can be used to test if a token occurs in a particular column or position in the input.
. : Matches any character.
# comment : Creates a comment that is ignored by SmaCC. Everything from the # to the end of the line is ignored.
exp1 | exp2 : Matches either exp1 or exp2.
exp1 exp2 : Matches exp1 followed by exp2. \d \d matches two digits.
exp* : Matches exp zero or more times. 0* matches "" and "000".
exp? : Matches exp zero or one time. 0? matches only "" or "0".
exp{min,max} : Matches exp at least min times but no more than max times. 0{1,2} matches only "0" or "00". It does not match "" or "000".
(exp) : Groups exp for precedence. For example, (a b)* matches "ababab". Without the parentheses, a b * would match "abbbb" but not "ababab".
Since there are multiple ways to combine expressions, we need precedence rules for their combination. The or operator, |, has the lowest precedence and the *, ?, +, and {,} operators have the highest precedence. For example, a | b c * matches "a" or "bcccc", but not "accc" or "bcbcbc". If you wish to match "a" or "b" followed by any number of c's, you need to use (a | b) c *.
If your scanner has a method name that matches the name of the token, (e.g. whitespace), that method will get called upon a match of that type. The SmaCCScanner
superclass already has a default implementation of SmaCCScanner>>#whitespace
and SmaCCScanner>>#comment
. These methods ignore those tokens by default.
If a token is not referenced from a grammar specification, it will not be included in the generated scanner, unless the token's name is also a name of a method (see previous section). This, coupled with the ability to do substitutions, allows you to have the equivalent of macros within your scanner specification. However, be aware that if you are simply trying to generate a scanner, you will have to make sure that you create a dummy parser specification that references all of the tokens that you want in the final scanner.
Parsing converts the stream of tokens provided by the scanner into some object. Normally, this object will be a abstract syntax tree (AST) or parse tree, but it does not have to be an AST. For example, the SmaCC tutorial shows a calculator. This calculator does not produce a parse tree; it produces the result, a number.
The production rules contains the definition of the parser. The first production rule is considered to be the starting rule for the parser. Each production rule consists of a non-terminal symbol name followed by a : separator which is followed by a list of possible productions separated by vertical bar, |, and finally terminated by a semicolon, ;.
Production : file:///someToken AnotherProduction "end" ;
Each production consists of a sequence of production rule names, token names, or keywords followed by some optional Smalltalk code enclosed in curly {} or square [] brackets or a AST node definition enclosed in two curly brackets, {{}}. Production rule names are valid Smalltalk variable names and must be defined somewhere in the parser definition. Forward references are valid. Tokens are enclosed in angle brackets as they are defined in the scanner grammar above and keywords are enclosed in double-quotes (e.g., "then"). Keywords that contain double-quotes need to have two double-quotes per each double-quote in the keyword. For example, if you need a keyword for one double-quote character, you would need to enter """" (four double-quote characters). Productions can use () to group items and also ?, *, and + modifiers to specify the number of occurrences for an item similar to what is allowed in the scanner definition above.
The Smalltalk code is evaluated whenever that production is matched. If the code is a zero or a one argument symbol, then that method is performed. For a one argument symbol, the argument is an OrderedCollection that contains one element for each item in the production. If the code isn't a zero or one argument symbol, then the code is executed and whatever is returned by the code is the result of the production. If no Smalltalk code is specified, then the default action is to execute the SmaCCParser>>#reduceFor:
method (unless you are producing an AST parser). This method converts all items into an OrderedCollection
. If one of the items is another OrderedCollection, then all of its elements are added to the new collection.
Inside the Smalltalk code you can refer to the values of each production item by using literal strings. The literal string, '1', refers the to value of the first production item. The values for tokens and keywords will be SmaCCToken
objects. The value for all non-terminal symbols will be whatever the Smalltalk code evaluates to for that non-terminal symbol. While you can reference every value using an integer for the name in the Smalltalk code, using an index makes it difficult to modify the grammar. For example, if you add something to the beginning of a rule, then you must modify all of the indices in the Smalltalk code. Instead you can name each symbol in the production and then refer to the name in the Smalltalk code. To name a symbol (non-terminal, token, or keyword), you need to add a single-quoted variable name after the symbol in the grammar. For example, MySymbol : Expression 'expr' "+" <number> 'num' {expr + num value asInteger} ; creates two named variables. One for the non-terminal Expression and one for the <number> token. These variables are then used in the Smalltalk code.
To generate an AST from the parser, you need to specify a root class for the AST hierarchy using the %root RootClass; directive. The RootClass will be combined with the %prefix Prefix; and %suffix Suffix; directives to create a new AST class (PrefixRootClassSuffix) as a subclass of SmaCCParseNode
. The %root directive is required to create an AST, but the %prefix and %suffix directives are optional.
Except for the %root class, classes are named based on their names inside the {{}} part of a production rule. If the name is explicitly defined, then a class with the %prefix and %suffix added will be created. For example, if we have {{Something}} and with the %prefix Prefix; and %suffix Suffix; directives, we would create a PrefixSomethingSuffix class name. This class would be a direct subclass of the %root class unless it was defined in a %hierarchy directive. If a production rule contains only a {{}} without a name, then the production's name will be used as the class name. If the name is in snake case, it will be converted to camel case to match Smalltalk's class naming strategy. For example, a production like this my_production : {{}}; would create a PrefixMyProductionSuffix class given the %prefix and %suffix directives above.
Named symbols in a production will be converted into instance variables of the generated AST node. If the name can occur multiple times, then the name will be pluralized and the variable will hold a collection. For example, a production like this prod : <integer> 'value'+ {{}}; will create a PrefixProdSuffix class with a values instance variable.
Not every production needs to specify an AST node name {{}}. If a production doesn't specify a name, then its variables will be added to all productions that use it.
The %hierarchy Superclass (Subclass1 Subclass2); directive can be used to create a hierarchy of the AST classes. For example, this directive would create a PrefixSuperclassSuffix class that has PrefixSubclass1Suffix and PrefixSubclass2Suffix as its subclasses. Variables that are common among all of the subclasses will be pushed up to the superclass. Also, the names listed in the %hierarchy directive do not need to exist inside any production rule. They will be created even if they are not listed.
In addition to the %root, %prefix, %suffix, and %hierarchy listed above, SmaCC has other directives to control code generation.
%start : The %start directive lists other potential starting states for the parser. Without listing any starting production, SmaCC generates a parser that starts with the first production in the file. To use one of the %start productions, you need to use the SmaCCParser>>#parse:startingAt:
. The value of the second argument comes from one of the startingStateFor methods that is generated from the %start directive. For example, if you have %start Expression;, you can parse with the startingAt: argument of startingStateForExpression.
%ignorecase; : This directive generates a case-insensitive scanner.
%glr; : Generates a GLR(1) parser instead of the standard LALR(1) parser. GLR parsers can handle ambiguous grammars by parsing all possible interpretations.
%id : The %id directive generates Id methods in the scanner that returns the scanner's id for the token.
%left, %right, %nonassoc: These directives tell SmaCC how to handle some ambiguous productions. The %left directive specifies that we should perform a reduce action instead of a shift so that an expression such as "1 + 2 + 3" is parsed as "(1 + 2) + 3" instead of "1 + (2 + 3)". The %right directive performs the shift action instead of the reduction so the expression is parsed as "1 + (2 + 3)". The %nonassoc directive specifies that it should be an error if this case occurs. For example, you might want to throw an error if you try to do multiple comparisons: "1 <= 2 <= 3". These directives can occur multiple times in a file. Items that occur later in the file will have priority over items that occur earlier.
%nfa; : SmaCC generates the regular expressions into methods on the scanner class. For large grammars, this can create an extremely large scanner class. The %nfa directive makes the scanner that is driven by the nfa created from the regular expressions. This scanner is quite a bit slower than the method based scanners, but can be faster to generate. When an nfa scanner is run for the first time, it builds the nfa from the grammar at stores it in the scanner class.