1. Syntax Definition

We start with the definition of the concrete syntax of Calc using the syntax definition formalism SDF3. A syntax definition formalism is a language for defining all aspects of the syntax of a programming language. That goes well beyond a language for defining just the grammar; SDF3 also covers the definition of lexical syntax, AST constructors, formatting directives, and more.

1.1. Modules

To start with, syntax definitions consist of modules that can import other syntax modules. This is useful for dividing a large grammar into parts, but also for reusing a standard language component (e.g. expressions) between language definitions, or for composing the syntax of different languages. We first examine the main syntax module for Calc, which is named after the language and which imports module CalcLexical which defines the lexical syntax of the language:

module Calc
imports CalcLexical

The module defines the sorts Program and Exp as start symbols, which means that parsing starts with these sorts:

context-free start-symbols Program Exp

1.2. Context-Free Syntax

Syntactically, a language is a set of well-formed sentences. (In programming languages, sentences are typically known as programs.) Sentences are typically formed by composing different kinds of phrases, such as identifiers, constants, expressions, statements, functions, and modules. In grammar terminology there are two broad categories of phrases, terminals and non-terminals. In SDF3 we use sorts to identify both categories. For Calc we start with defining the Program and Stat sorts:

sorts Program Stat

A grammar consists of rules (known as productions) for composing phrases from sub-phrases. A Calc program consists of a list of statements that are either bindings that bind the value of an expression to a variable (identifier) or expression statements. This is defined by the following productions:

context-free syntax

  Program.Program = <<{Stat "\n"}+>>

  Stat.Bind = <<ID> = <Exp>;>
  Stat.Exp  = <<Exp>;>

If we take a closer look at the Stat.Bind production we see the following ingredients:

  • The production defines one of two alternatives for the Stat sort. The alternatives of a sort are defined by separate productions. This makes it possible to introduce productions in an order that makes sense for presenting a language definition. Instead of defining all productions for a sort in one block, it is rather possible to define the productions for different sorts that together define a language concept together. Furthermore, it enables modular definition of syntax.
  • The body a production defines the composition of sub-phrases that it corresponds to. Thus, the body <<ID> = <Exp>;> defines a bind statement as the composition of an identifier, followed by an equal sign, followed by an expression, terminated by a semicolon.
  • The body is known as a template and uses inverse quotation. The template makes everything inside literal elements of the text to be parsed, except for the quasi-quoted sorts (<ID> and <Exp>).
  • The sub-phrases are implicitly separated by layout (whitespace and comments). The definition of layout is not built-in. We will see the definition of the layout for Calc when we discuss lexical syntax below.
  • The constructor is used to construct abstract syntax tree nodes. Thus, the Bind constructor creates trees with two arguments trees for the identifier (ID) and expression (Exp) subtrees; in abstract syntax we leave out the literals and layout.

Note that a program is defined as a list of one or more statements, which could be expressed with the regular expression operator + as Stat+. The SDF3 notation {Sort sep}+ denotes a list of Sort phrases separated by sep. For example, {Exp ","}+ is a list of one or more expressions separated by commas. In the definition of statement lists we use a newline as separator. However, this does not imply that statements should be separated by newlines, but rather that newlines are inserted when formatting a program, as we will discuss below.

1.3. Expressions

Sorts and productions give us the basic concepts for defining syntax. Calc programs essentially consist of a sequence of expressions. So, the bulk of the its syntax definition consists of productions for various expression forms denoted by the Exp sort:

sorts Exp
context-free syntax
  Exp = <(<Exp>)> {bracket}

The bracket production defines that we can enclose an expression in parentheses. The bracket annotation states that we can ignore this production when constructing abstract syntax trees. That is, the abstract syntax tree for (x + 1) is the same as the abstract syntax tree for x + 1.

1.4. Operator Syntax

Operators are the workhorse of a language such as Calc. They capture the domain-specific operations that the language is built around. We start with the syntax of arithmetic operators:

context-free syntax // numbers
  Exp.Num = NUM
  Exp.Min = <-<Exp>>
  Exp.Pow = <<Exp> ^ <Exp>> {right}
  Exp.Mul = <<Exp> * <Exp>> {left}
  Exp.Div = <<Exp> / <Exp>> {left}
  Exp.Sub = <<Exp> - <Exp>> {left, prefer}
  Exp.Add = <<Exp> + <Exp>> {left}

Note that the concrete syntax is directly aligned with the abstract syntax. An addition is represented as the composition of two expression and the + symbol. This is best illustrated using term notation for abstract syntax trees. The term C(t1, ..., tn) denotes the abstract syntax tree for a production with constructor C and n sub-trees. For example, the term Add(Num("1"), Var("x")) represents the abstract syntax tree for the expression 1 + x.

The consequence of this direct alignment is that the grammar is ambiguous. According to the Exp.Add production there are two ways to parse the expression 1 + x + y, i.e. as Add(Add(Num("1"), Var("x")), Var("y")) or as Add(Num("1"), Add(Var("x"), Var("y"))).

A common approach to disambiguate the grammar for an expression language is by encoding the associativity and precedence of operators in the productions using additional sorts to represent precedence levels. However, that leads to grammars that are hard to understand and maintain and that do not have a one-to-one correspondence to the desired abstract syntax.

In SDF3, ambiguous expression syntax can be declaratively disambiguated using separate associativity and priority declarations. For example, the Exp.Add production above defines that addition is left associative. That is, the expression 1 + x + y should be interpreted as Add(Add(Num("1"), Var("x")), Var("y")), i.e. (1 + x) + y. The other operators are disambiguated similarly according to standard mathematical conventions. Note that power (exponentiation) is right associative, i.e. x ^ y ^ z is equivalent to x ^ (y ^ z).

comparison operators:

context-free syntax // numbers
  Exp.Eq  = <<Exp> == <Exp>> {non-assoc}
  Exp.Neq = <<Exp> != <Exp>> {non-assoc}
  Exp.Gt  = [[Exp] > [Exp]]  {non-assoc}
  Exp.Lt  = [[Exp] < [Exp]]  {non-assoc}

Non-assoc means that a phrase such as a < b == true is not syntactically well-formed. One should use parentheses, for example (a < b) == true, to explicitly indicate the disambiguation.

booleans:

context-free syntax // booleans

  Exp.True  = <true>
  Exp.False = <false>
  Exp.Not   = <!<Exp>>
  Exp.And   = <<Exp> & <Exp>> {left}
  Exp.Or    = <<Exp> | <Exp>> {left}

  Exp.If = <
    if(<Exp>)
      <Exp>
    else
      <Exp>
  >

variables:

context-free syntax // variables and functions

  Exp.Var = ID
  Exp.Let = <
    let <ID> = <Exp> in
    <Exp>
  >
  Exp.Fun = <\\ <ID+> . <Exp>>
  Exp.App = <<Exp> <Exp>> {left}

1.5. Disambiguation

priorities:

context-free priorities
  Exp.Min
  > Exp.App
  > Exp.Pow
  > {left: Exp.Mul Exp.Div}
  > {left: Exp.Add Exp.Sub}
  > {non-assoc: Exp.Eq Exp.Neq Exp.Gt Exp.Lt}
  > Exp.Not
  > Exp.And
  > Exp.Or
  > Exp.If
  > Exp.Let
  > Exp.Fun

sorts Type
context-free syntax
  Type.NumT  = <Num>
  Type.BoolT = <Bool>
  Type.FunT  = [[Exp] -> [Exp]] {right}
  Type       = <(<Type>)> {bracket}

template options
  ID = keyword {reject}

1.6. Lexical Syntax

lexical syntax:

module CalcLexical

identifiers:

lexical syntax
  ID = [a-zA-Z] [a-zA-Z0-9]*
lexical restrictions
  ID -/- [a-zA-Z0-9\_]

numbers:

lexical syntax // numbers
  INT      = "-"? [0-9]+
  IntGroup = [0-9][0-9][0-9]
  IntPref  = ([0-9] | ([0-9][0-9])) ","
  INT      = IntPref? {IntGroup ","}+
  FLOAT    = INT "." [0-9]+
  NUM      = INT | FLOAT
lexical restrictions
  INT   -/- [0-9]
  FLOAT -/- [0-9]
  NUM   -/- [0-9]

strings:

lexical syntax
  STRING         = "\"" StringChar* "\""
  StringChar     = ~[\"\n]
  StringChar     = "\\\""
  StringChar     = BackSlashChar
  BackSlashChar  = "\\"
lexical restrictions
  // Backslash chars in strings may not be followed by "
  BackSlashChar -/- [\"]

layout:

lexical syntax // layout: whitespace and comments
  LAYOUT         = [\ \t\n\r]
  CommentChar    = [\*]
  LAYOUT         = "/*" InsideComment* "*/"
  InsideComment  = ~[\*]
  InsideComment  = CommentChar
  LAYOUT         = "//" ~[\n\r]* NewLineEOF
  NewLineEOF     = [\n\r]
  NewLineEOF     = EOF
  EOF            =
lexical restrictions
  CommentChar -/- [\/]
  // EOF may not be followed by any char
  EOF         -/- ~[]
context-free restrictions
  // Ensure greedy matching for comments
  LAYOUT? -/- [\ \t\n\r]
  LAYOUT? -/- [\/].[\/]
  LAYOUT? -/- [\/].[\*]

1.7. Grammar Interpretations

A grammar can be interpreted for (at least) the following operations:

Parsing
Recognizing a well-formed sentence and constructing an abstract syntax tree
Signature
Derive schema that defines well-formed abstract syntax trees
Formatting
Map an abstract syntax tree to a well-formed sentence
Parse error recovery
When editing programs, the program text is often in a syntactically incorrect state. Since all editor services depend on an AST representation of the program, getting stuck on syntax errors would reduce the utility of an editor. To get a better editing experience, a parser with error recovery does a best effort job to parse as much as possible and still produce an AST.
Syntactic completion
Using a new language