-
Notifications
You must be signed in to change notification settings - Fork 15
ωBNF
Arr.ai supports dynamic runtime parsing, using a grammar notation called ωBNF. This allows grammars to be entered into Arr.ai and used to create data structures via sugared syntax. If that doesn't make much sense, here's an example. The following snippet of code looks like XML:
{%xml:
<city id="123">
<name>Melbourne</name>
<country>Australia</name>
</city>
%}
But it's really syntactic sugar for something resembling the following Arr.ai syntax:
["city", (id: "123"),
["name", "Melbourne"],
["country", "Australia"],
]
The purpose of this document is not to explain how the above works. It will instead focus on introducing ωBNF syntax. One it makes a little sense, go to the bottom of the page for a sample XML grammar that can parse the above XML snippet.
ωBNF is self-hosting:
// Non-terminals
grammar -> stmt+;
stmt -> COMMENT | prod;
prod -> IDENT "->" term+ ";";
term -> term:op="^"
^ term:op="|"
^ term+
^ named quant*;
quant -> op=/{[?*+]}
| "{" min=INT? "," max=INT? "}"
| op=/{<:|:>?} lbang="!"? named rbang="!"?;
named -> (IDENT op="=")? atom;
atom -> IDENT | STR | RE | "(" term ")" | "(" ")";
// Terminals
IDENT -> /{[A-Za-z_\.]\w*};
STR -> /{"(?:\\.|[^\\"])*"|'(?:\\.|[^\\'])*'|‵(?:‵‵|[^‵]*)‵};
INT -> /{\d+};
RE -> /{/{((?:\\.|[^\\\}])*)\}};
COMMENT -> /{//.*$|(?s:/\*(?:[^*]|\*+[^*/])\*/)};
// Special
.wrapRE -> /{\s*()\s*};
Of course, it might help to understand the constructs in order to be able to read this grammar to begin with.
grammar -> stmt+;
stmt -> COMMENT | prod;
COMMENT -> /{//.*$|(?s:/\*(?:[^*]|\*+[^*/])\*/)};
A production defines a term that can be referred to from other parts of a grammar. There is no root production in a grammar. Any production can be used to parse some content.
A comment can be either a single line comment:
// This is a comment.
or a delimited comment:
/* This comment flows onto
multiple lines. */
Delimited comments cannot be nested.
prod -> ident "->" term+ ";";
A production defines a term that can be referred to by. Any production can be used to parse some content.
term -> term:op="^"
^ term:op="|"
^ term+
^ named quant*;
named -> (IDENT op="=")? atom;
A term defines a set of patterns that describe the structure of some language.
The above grammar depictes a term
as a set of lines with a ^
separating some
elements. The ^
operator denotes a precedence stack. Earlier elements have a
lower precedence than later ones.
In each part, any reference to the production's name is actually a reference to the next level down in the stack.
Precedence towers are in fact a shorthand for multiple mutually referential productions. E.g.: the following forms are roughly equivalent:
// Conventional | // Precedence stack
expr -> add; |
add -> mul:"+"; | expr -> expr:"+"
mul -> neg:"*"; | ^ expr:"*"
neg -> "-"? atom; | ^ "-"? expr
atom -> /{\d+} | "(" expr ")"; | ^ /{\d+} | "(" expr ")";
The only material difference is that, in the conventional form, every term
belongs to a production and can therefore be parsed independently. This isn't an
issue in practice, since you rarely want to parse from, say, mul
down in the
epxression grammar example. You can always start at expr
and still recognise
multiplicative expressions. Also, in the event that you do need to refer into
the stack, it's relatively straightforward to break it somewhere in the middle:
expr -> expr:"+"
^ expr:"*"
unary: -> "-"? unary
^ \{\d+} | "(" expr ")";
And ultimately the conventional form is still available in ωBNF.
Terms separated by |
are alternatives. A matching against either side will
result in a successful parse. E.g.: "A" | "B" | "C"
will parse any of the
letters A, B or C.
Adjacent terms will match sequentially. For instance "A" "B" "C"
will match
ABC while "A" ("B" | "C")
will match AB and AC.
A named term has no semantic significance. It used to label certain terms for easier reference in code that consumes parser output. The details of how named terms work are outside the scope of this document.
^ named quant*;
quant -> op=/{[?*+]}
| "{" min=INT? "," max=INT? "}"
| op=/{<:|:>?} lbang="!"? named rbang="!"?;
named -> (IDENT op="=")? atom;
atom -> IDENT | STR | RE | "(" term ")" | "(" ")";
An identifier such as expr
matches the production it names.
A string matches an exact string, e.g. "int"
or "+"
. String syntax is
fairly close to Go string syntax:
double qoutes: "abc" // " must be escaped as \". Other escapes work as usual.
single quotes: 'abc' // ' must be escaped as \'. Other escapes work as usual.
backquotes: `abc` // ` must be escaped as ``. Other escapes not supported.
A regular expression, or regexp, matches strings conforming to a
pattern, e.g.: /{[a-z]+}
. RE2 Syntax describes the syntax for regular
expressions.
A parenthesised expression is useful for grouping terms in a way that
precedence doesn't naturally follow, e.g.: , e.g. ("A" | "B") "C"
. With
parentheses, this matches AC and BC. Without them, it would match A and BC.
Optional quantifiers match zero or once instance of the associated term. E.g.:
"A"?
matches the empty string or A.
Repetition quantifiers match repeated sequences:
-
a*
matches zero or more repetitions ofa
. E.g.,"A"*
matches the empty string or A, AA, AAA, ... -
a+
matches one or more repetitions ofa
. E.g.,"A"+
matches A, AA, AAA, ... -
"A"{m,n}
matches a sequence of between m and n A's, inclusive. E.g.,"AB"{2, 3}
matches AB, ABAB and ABABAB.
All repeat quantifiers are greedy, consuming as much input as they can.
Delimitation quantifiers match repeated terms with a delimiter separating each one.
-
a:b
matches a sequence ofa
delimited byb
. E.g.,"A":"+"
matches A, A+A, A+A+A, ...- There is no left-to-right or right-to-left precedence defined for
a:b
. The output of parsing a delimited sequence is simply a flat list ofa
s andb
s., E.g.:/{[a-z0-9]}:("="|"!="|"<="|"<"|">"|">=")
will match0<=x<10
. - This may be changed as follows:
-
a:>b
triggers left-to-right precedence./{\d}:>"+"
will match1+2+3+4
and output them as binary groups:(((1+2)+3)+4)
. -
a<:b
triggers left-to-right precedence./{\d}:>"**"
will matchx**y**z
and output(x**(y**z))
.
-
- There is no left-to-right or right-to-left precedence defined for
- The following variants of
a:b
are also defined:-
a:!b
allows the sequence to optionally start withb
. -
a:b!
allows the sequence to optionally end withb
. -
a:!b!
allows the sequence to optionally start and/or end withb
. - These variants exist primarily for completeness, so that every possible
operation of two terms can be expressed without repeating either term. For
instance,
a:b
can be written asa(ba)*
, but this requiresa
to be repeated. - One practical use is optional trailing commas, E.g.:
params -> expr:","!
. - Precedence adjustments and the above optionality modifiers are orthogonal
and can be applied interchangeably, e.g.:
"A":>!"*"
will matchA*A*A
and will output(A*A)*A
. It will also match*A*A*A
and will output(((∅*A)*A)*A)
(∅
denotes a missing value).
-
All delimited quantifiers are greedy, consuming as much input as they can.
IDENT -> /{[A-Za-z_\.]\w*};
STR -> /{"(?:\\.|[^\\"])*"|'(?:\\.|[^\\'])*'|‵(?:‵‵|[^‵]*)‵};
INT -> /{\d+};
RE -> /{/{((?:\\.|[^\\\}])*)\}};
COMMENT -> /{//.*$|(?s:/\*(?:[^*]|\*+[^*/])\*/)};
There is no formal distinction between terminals and non-terminals. They're just a bunch of recurring regexps with a convention of being written in all uppercase.
The following isn't a real production. It's a way to globally alter the parser's behaviour.
.wrapRE -> /{\s*()\s*};
.wrapRE
is a special production. If it exists, it must hold a single regexp
term with a ()
somewhere inside it. This will be used as a substiution for
every other regexp that appears in the grammar. The sterotypical usage is as in
the ωBNF grammar itself, where it is used to tweak all other regexps so that
they skip all whitespace (outside strings and regexps, that is).
The following is a very simplified version of the XML grammar expressed in ωBNF:
xml -> "<" NAME attr* ">" xml* "</" NAME ">" | CDATA=/{[^<]*};
attr -> NAME "=" value=/{"[^"]*"};
NAME -> /{[A-Za-z_:][-A-Za-z0-9._:]}