Skip to content

Concise monadic parser combinator library with separate lexer/parser phases, off-side rule and big-size input support.

License

Notifications You must be signed in to change notification settings

DNemtsov/Lexepars

Repository files navigation

NuGet

Lexepars

Lexepars is a concise, lightweight, but powerful monadic parser combinator library. Monadic means that you always get conclusive parsing results. Combininig easier parsers together allows you to build however complex grammars (which are, in turn, parsers). The library can compete with the full-featured parser generation suites such as Antlr while not possessing their wordiness and the overhead of dealing with the custom grammar specification languages, multiple tools in the chain and the code "bestowed upon you" in your project. What makes the Lexepars so powerful is its separate tokenization and parsing phases. A perfect illustration of that is the only dotnet parser combinator lib capable of parsing the off-side rule languages where the indents are part of the syntax, even as tricky as YAML. Check out the YAML Grammar! It's written entirely in C# and is fully CLS-compliant, so you can create your grammars right in your code in any dotnet language. And don't forget to set up the NuGet symbol server to make debugging a breeze:) Lexepars is deemed as the natural evolution of Patrick Lioi's Parsley with an emphasis made on memory and speed optimizations for very big input files, as well as client code conciseness and flexibility of usage. It can parse context-sensitive, infinite look-ahead grammars, but it performs best on LL(1) grammars and languages.

Installation

First, install NuGet. Then, install the Lexepars package from the package manager console:

PM> Install-Package Lexepars

It's just a small single assembly, with no strings attached:)

Tokenization (Lexing)

The lexer phase is performed with a prioritized list of regex patterns, and parser grammars are expressed in terms of the tokens produced by the lexer. Input text being parsed is represented with a InputText or, better yet, a LinedInputText instance, which tracks the current parsing position:

    var text = new Text("some input to parse");

The lexer phase is implemented by anything that produces an IEnumerable<Token>. The default implementation, Lexer, builds the series of tokens when given a prioritized series of TokenKind token recognizers. The most common TokenKind implementation is PatternTokenKind, which recognizes tokens via regular expressions, but there are many more. TokenKinds can be marked as skippable, if you want them to be recognized but discarded:

    var text = new Text("1 2 3 a b c");
    var lexer = new Lexer(new Pattern("letter", @"[a-z]"),
                          new Pattern("number", @"[0-9]+"),
                          new Pattern("whitespace", @"\s+", skippable: true));

    Token[] tokens = lexer.ToArray();

Above, the array tokens will contain 6 Token objects. Each Token contains the lexeme ("1", "a", etc), the TokenKind that matched it, and the Position where the lexeme was found.

The collection of Token produced by the lexer phase is wrapped in a TokenStream, which allows the rest of the system to traverse the collection of tokens in an immutable fashion.

Parsing

Parser consumes a TokenStream and produces a parsed value based on the lexemes of the parsed tokens:

public interface IParser<out TValue> : IGeneralParser
{
    IReply<TValue> Parse(TokenStream tokens);
}

When the value is not needed, e.g. for grammar validation purposes, IGeneralParser comes into play as a valuable means of optimization.

public interface IGeneralParser
{
    IGeneralReply ParseGenerally(TokenStream tokens);
}

IReply<out TValue> and IGeneralReply describe whether or not the parser succeeded, the parsed value (on success), a possibly-empty error message list, and a reference to the TokenStream representing the remaining unparsed tokens.

Grammars

Grammars should inherit from Grammar to take advantage of the numerous parser primitives and extensions that serve as the building blocks for the rules. Grammars should define each grammar rule in terms of these primitives, ultimately exposing the entry rule as some Parser<TValue>. Grammar rule bodies may consist of LINQ queries, which allow you to glue together other grammar rules in sequence:

See the integration tests for a sample JSON grammar.

Finally, we can put all these pieces together to parse some text:

    var input = "{\"zero\" : 0, \"one\" : 1, \"two\" : 2}";
    var lexer = new JsonLexer();
    var tokens = lexer.Tokenize(input);
    var jobject = (JObject)JsonGrammar.Json.Parse(tokens).Value;

Documentation

Because of the fast evolution of the library, it would be hard to maintain standalone documentation. Therefore, much attention is paid to keeping the code either self-explanatory, or furnished with the sufficient documentation comments with the advantage of having them available in the compiled assembly. Here are the summarized points of interest to go through:

Enjoy and happy time coding!