Skip to content
Marcelo Cantos edited this page Jan 6, 2020 · 4 revisions

ωBNF: The ultimate grammar behind Arr.ai's dynamic parsing system

Arr.ai supports dynamic runtime parsing, using a grammar notation called ωBNF. This allows grammars to be entered into Arr.ai and used to create data structures via sugared syntax. If that doesn't make much sense, here's an example. The following snippet of code looks like XML:

{%xml:
<city id="123">
  <name>Melbourne</name>
  <country>Australia</name>
</city>
%}

But it's really syntactic sugar for something resembling the following Arr.ai syntax:

["city", (id: "123"),
    ["name", "Melbourne"],
    ["country", "Australia"],
]

The purpose of this document is not to explain how the above works. It will instead focus on introducing ωBNF syntax. One it makes a little sense, go to the bottom of the page for a sample XML grammar that can parse the above XML snippet.

Grammar

ωBNF is self-hosting:

// Non-terminals
grammar -> stmt+;
stmt    -> COMMENT | prod;
prod    -> IDENT "->" term+ ";";
term    -> term:op="^"
         ^ term:op="|"
         ^ term+
         ^ named quant*;
quant   -> op=/{[?*+]}
         | "{" min=INT? "," max=INT? "}"
         | op=/{<:|:>?} lbang="!"? named rbang="!"?;
named   -> (IDENT op="=")? atom;
atom    -> IDENT | STR | RE | "(" term ")" | "(" ")";

// Terminals
IDENT   -> /{[A-Za-z_\.]\w*};
STR     -> /{"(?:\\.|[^\\"])*"|'(?:\\.|[^\\'])*'|‵(?:‵‵|[^‵]*)‵};
INT     -> /{\d+};
RE      -> /{/{((?:\\.|[^\\\}])*)\}};
COMMENT -> /{//.*$|(?s:/\*(?:[^*]|\*+[^*/])\*/)};

// Special
.wrapRE -> /{\s*()\s*};

source

Of course, it might help to understand the constructs in order to be able to read this grammar to begin with.

Syntax rules

grammar, stmt, comment

grammar -> stmt+;
stmt    -> COMMENT | prod;
COMMENT -> /{//.*$|(?s:/\*(?:[^*]|\*+[^*/])\*/)};

A production defines a term that can be referred to from other parts of a grammar. There is no root production in a grammar. Any production can be used to parse some content.

A comment can be either a single line comment:

// This is a comment.

or a delimited comment:

/* This comment flows onto
   multiple lines. */

Delimited comments cannot be nested.

prod: a -> b c

prod    -> ident "->" term+ ";";

A production defines a term that can be referred to by. Any production can be used to parse some content.

term

term    -> term:op="^"
         ^ term:op="|"
         ^ term+
         ^ named quant*;
named   -> (IDENT op="=")? atom;

A term defines a set of patterns that describe the structure of some language.

Precedence stack: a ^ b

The above grammar depictes a term as a set of lines with a ^ separating some elements. The ^ operator denotes a precedence stack. Earlier elements have a lower precedence than later ones.

In each part, any reference to the production's name is actually a reference to the next level down in the stack.

Precedence towers are in fact a shorthand for multiple mutually referential productions. E.g.: the following forms are roughly equivalent:

// Conventional                | // Precedence stack
expr -> add;                   |
add  -> mul:"+";               | expr -> expr:"+"
mul  -> neg:"*";               |       ^ expr:"*"
neg  -> "-"? atom;             |       ^ "-"? expr
atom -> /{\d+} | "(" expr ")"; |       ^ /{\d+} | "(" expr ")";

The only material difference is that, in the conventional form, every term belongs to a production and can therefore be parsed independently. This isn't an issue in practice, since you rarely want to parse from, say, mul down in the epxression grammar example. You can always start at expr and still recognise multiplicative expressions. Also, in the event that you do need to refer into the stack, it's relatively straightforward to break it somewhere in the middle:

expr   -> expr:"+"
        ^ expr:"*"
unary: -> "-"? unary
        ^ \{\d+} | "(" expr ")";

And ultimately the conventional form is still available in ωBNF.

Choice: a | b

Terms separated by | are alternatives. A matching against either side will result in a successful parse. E.g.: "A" | "B" | "C" will parse any of the letters A, B or C.

Sequence: a b

Adjacent terms will match sequentially. For instance "A" "B" "C" will match ABC while "A" ("B" | "C") will match AB and AC.

Named term: <foo>a

A named term has no semantic significance. It used to label certain terms for easier reference in code that consumes parser output. The details of how named terms work are outside the scope of this document.

Atoms and quantifiers

         ^ named quant*;
quant   -> op=/{[?*+]}
         | "{" min=INT? "," max=INT? "}"
         | op=/{<:|:>?} lbang="!"? named rbang="!"?;
named   -> (IDENT op="=")? atom;
atom    -> IDENT | STR | RE | "(" term ")" | "(" ")";


Identifier: foo

An identifier such as expr matches the production it names.

String: "struct"

A string matches an exact string, e.g. "int" or "+". String syntax is fairly close to Go string syntax:

double qoutes: "abc"  // " must be escaped as \". Other escapes work as usual.
single quotes: 'abc'  // ' must be escaped as \'. Other escapes work as usual.
backquotes:    `abc`  // ` must be escaped as ``. Other escapes not supported.

Regular expression: /{[0-9]+}

A regular expression, or regexp, matches strings conforming to a pattern, e.g.: /{[a-z]+}. RE2 Syntax describes the syntax for regular expressions.

Parentheses: (a b c)

A parenthesised expression is useful for grouping terms in a way that precedence doesn't naturally follow, e.g.: , e.g. ("A" | "B") "C". With parentheses, this matches AC and BC. Without them, it would match A and BC.

Optional: annotation?

Optional quantifiers match zero or once instance of the associated term. E.g.: "A"? matches the empty string or A.

Repetition: a* a+ a{2, 4}

Repetition quantifiers match repeated sequences:

  1. a* matches zero or more repetitions of a. E.g., "A"* matches the empty string or A, AA, AAA, ...
  2. a+ matches one or more repetitions of a. E.g., "A"+ matches A, AA, AAA, ...
  3. "A"{m,n} matches a sequence of between m and n A's, inclusive. E.g., "AB"{2, 3} matches AB, ABAB and ABABAB.

All repeat quantifiers are greedy, consuming as much input as they can.

Delimitation: expr:","

Delimitation quantifiers match repeated terms with a delimiter separating each one.

  1. a:b matches a sequence of a delimited by b. E.g., "A":"+" matches A, A+A, A+A+A, ...
    • There is no left-to-right or right-to-left precedence defined for a:b. The output of parsing a delimited sequence is simply a flat list of as and bs., E.g.: /{[a-z0-9]}:("="|"!="|"<="|"<"|">"|">=") will match 0<=x<10.
    • This may be changed as follows:
      • a:>b triggers left-to-right precedence. /{\d}:>"+" will match 1+2+3+4 and output them as binary groups: (((1+2)+3)+4).
      • a<:b triggers left-to-right precedence. /{\d}:>"**" will match x**y**z and output (x**(y**z)).
  2. The following variants of a:b are also defined:
    • a:!b allows the sequence to optionally start with b.
    • a:b! allows the sequence to optionally end with b.
    • a:!b! allows the sequence to optionally start and/or end with b.
    • These variants exist primarily for completeness, so that every possible operation of two terms can be expressed without repeating either term. For instance, a:b can be written as a(ba)*, but this requires a to be repeated.
    • One practical use is optional trailing commas, E.g.: params -> expr:","!.
    • Precedence adjustments and the above optionality modifiers are orthogonal and can be applied interchangeably, e.g.: "A":>!"*" will match A*A*A and will output (A*A)*A. It will also match *A*A*A and will output (((∅*A)*A)*A) ( denotes a missing value).

All delimited quantifiers are greedy, consuming as much input as they can.

Terminals

IDENT   -> /{[A-Za-z_\.]\w*};
STR     -> /{"(?:\\.|[^\\"])*"|'(?:\\.|[^\\'])*'|‵(?:‵‵|[^‵]*)‵};
INT     -> /{\d+};
RE      -> /{/{((?:\\.|[^\\\}])*)\}};
COMMENT -> /{//.*$|(?s:/\*(?:[^*]|\*+[^*/])\*/)};

There is no formal distinction between terminals and non-terminals. They're just a bunch of recurring regexps with a convention of being written in all uppercase.

Special production

The following isn't a real production. It's a way to globally alter the parser's behaviour.

.wrapRE

.wrapRE -> /{\s*()\s*};

.wrapRE is a special production. If it exists, it must hold a single regexp term with a () somewhere inside it. This will be used as a substiution for every other regexp that appears in the grammar. The sterotypical usage is as in the ωBNF grammar itself, where it is used to tweak all other regexps so that they skip all whitespace (outside strings and regexps, that is).

Samples

XML

The following is a very simplified version of the XML grammar expressed in ωBNF:

xml  -> "<" NAME attr* ">" xml* "</" NAME ">" | CDATA=/{[^<]*};
attr -> NAME "=" value=/{"[^"]*"};
NAME -> /{[A-Za-z_:][-A-Za-z0-9._:]}
Clone this wiki locally