Skip to content

Commit

Permalink
Merge pull request #166 from rzk-lang/blogpost/formatter
Browse files Browse the repository at this point in the history
Add a post about the formatter
  • Loading branch information
aabounegm authored Jan 7, 2024
2 parents 7f1ba9a + 40632b9 commit 81d20ea
Show file tree
Hide file tree
Showing 3 changed files with 287 additions and 0 deletions.
28 changes: 28 additions & 0 deletions README.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,28 @@
# Rzk blog

## Setup

When setting up MkDocs for the first time, it is recommended to create a Python venv and activate it:
```bash
python -m venv venv
venv\Scripts\Activate.ps1
```

The activation command will depend on the OS. The command shown above is for PowerShell on Windows.

Then, install the requirements:
```bash
pip install -r docs/requirements.txt
```

## Running

Activate the venv (Windows - PowerShell):
```bash
venv\Scripts\Activate.ps1
```

Run the dev server:
```bash
mkdocs serve --config-file docs/config/en/mkdocs.yml
```
5 changes: 5 additions & 0 deletions docs/docs/en/.authors.yml
Original file line number Diff line number Diff line change
Expand Up @@ -4,3 +4,8 @@ authors:
description: Random guy
avatar: https://avatars.githubusercontent.com/u/686582
url: https://github.com/fizruk
aabounegm:
name: Abdelrahman Abounegm
description: Language Server maintainer
avatar: https://avatars.githubusercontent.com/u/11016151
url: https://github.com/aabounegm
254 changes: 254 additions & 0 deletions docs/docs/en/posts/2023-12-24-formatter.rzk.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,254 @@
---
authors:
- aabounegm
date: 2023-12-24
---

# Rzk Formatter

Version 0.7.0 (and later 0.7.3 with bug fixes) of Rzk has recently been released,
and it brings with it an exciting new feature: the auto formatter!
It has been under work for over 3 weeks and came with unique design challenges
that I will attempt to unpack in this post.

<!-- more -->

## Goal

Basically, the target is to write a program that takes in Rzk code as input and
makes sure it abides by the [style guide](https://rzk-lang.github.io/sHoTT/STYLEGUIDE/) (developed mainly by [Fredrik Bakke](https://github.com/fredrik-bakke) and other contributors to the [sHoTT project](http://github.com/rzk-lang/sHoTT)) with minimal modification otherwise.
For example, comments should stay intact and semantics should be preserved.

We decided to not support any customization (at least initially), making the
formatter heavily opinionated.

## Challenges

Designing a formatter has both technical and nontechnical challenges.
From the nontechnical side, and especially since the formatter is not customizable,
there is the challenge of having the formatting requirements formulated in a
precise enough language that can be translated easily into code, assuming there
is even consensus among the community on these requirements.

From the technical perspective, the parser generator currently used (BNFC) loses
a lot of the information necessary for reconstructing the source program, so a
pretty-printer based on the parse result is not feasible (for example, all comments would be lost).
Even after this difficulty is overcome, there are still more challenges regarding the formatting itself.
For example, if a given line is too long and it is necessary to wrap it by
inserting a new line somewhere in the line, there are $O(2^n)$ possible choices
(where $n$ is the number of locations in the code where a new line could be inserted)
to compare between for inserting line breaks.
For example, consider the following Rzk code:
```rzk
#define id (A : U) (x : A) : A := x
```
which has 11 positions where a new line can be inserted, as shown by these markers:
```rzk
#define id (A : U) (x : A) : A := x
-- ^ ^ ^ ^ ^ ^ ^ ^ ^ ^ ^
```
This amounts to 2048 combinations that have to be ranked for such a simple line.
Even the ranking itself is not a straightforward problem to tackle.

Additionally, to properly indent each line, it is necessary to understand the
program semantics, which increases the complexity of the formatter's parser.

## Possible approaches (and their tradeoffs)

Multiple possible solutions exist for the above difficulties, each with their pros and cons.
The alternatives we considered for Rzk's formatter can be summarized as follows:

1. Implement a [Prettier](https://prettier.io/) plugin
1. Reuse the existing parser generated by BNFC
1. Write a dedicated parser for the formatter specifically
1. Reuse the exiting lexer generated by BNFC

### Prettier plugin

Seeing as some formalization projects were already making use of Prettier, a plugin
to support Rzk seemed like a reasonable option. However, that option didn't seem
so attractive for a few reasons.
First of all, it would mean forcing all users to install Prettier, which requires
having the JavaScript toolchain (in particular, `npm`) installed, but not all Rzk
developers use JavaScript. It would be better to have an in-house solution.

Additionally, [Prettier plugins](https://prettier.io/docs/en/plugins) can only be
written in JavaScript, which means having to maintain a separate parser written
from scratch in JS, which is not maintainable or even desirable.

### BNFC's parser

Rzk makes use of [BNFC](https://bnfc.digitalgrammars.com/) which produces a lexer,
parser, AST, and pretty-printer for a given _Labelled BNF_ grammar file.
The generated pretty-printer is not configurable, so it would take a lot of dirty
patching to have it follow the style guide of Rzk code.

As for the _abstract syntax tree_ (AST) it generates, it is also not suitable
particularly for being too _abstract_. It loses all information about spaces,
parentheses, and basically any syntactic construction, which is useful for a
compiler or typechecker, but not so much for a formatter.

### Dedicated parser

An alternative to the generated parser would be to manually write another parser
that keeps all the information relevant to the formatter in a **concrete syntax tree** (CST).
This would be the most customizable option, but it would also be a lot of repeated
work (since we already have the generated parser) that is additionally more error-prone.

Writing a parser manually is time-consuming and redundant since it can easily be
automatically generated from the much more readable grammar file.

### BNFC's lexer

Since BNFC generates a lexer that is used by the parser, a more rudimentary
approach would be to make use of that lexer for our formatter.
As the lexer simply tokenizes the input source code, it does not make any
decisions regarding what to throw away and keeps all syntax constructions (except comments).
Therefore, this approach would have the least amount of syntactic information lost
among the automatically generated tools.

On the hand, working with a stream of tokens is not as easy as working with a parsed tree.
First, tokens do not carry almost any semantics with them, so it is difficult to
understand whether an identifier refers to a parameter, a function application,
or if it's even still being declared.
Also, without having semantics, it is difficult to judge out which line split
would make more sense when inserting line breaks in a long line, or to figure out
how much a given line should be indented.

## Our chosen solution

The approach that seemed the most suitable of the above was the last one: use BNFC's lexer.
After going through the style guide a few times, it was clear that the vast majority
of the rules/recommendations did not involve understanding much semantics.
For example, Unicode replacement for ASCII sequences, moving `:=` to a new line,
and adding a space after `\\` in a lambda term can all be applied blindly.

For the few rules that require having some semantics (like distinguishing between `:`
for definition conclusion and for other types), it was enough to just store some
state that is preserved (and updated) while traversing the tokens; no additional
semantic analysis was necessary.

The actual implementation in Haskell is just pattern matching on tokens with some
conditions and applying edits to the text in response.
For example, the following is an excerpt of the pattern match for opening parenthesis:

```haskell
-- Ensure exactly one space after the first open paren of a line
go :: FormatState -> [Token] -> [FormattingEdit]
go s (Token "(" line col : tks)
| precededBySingleCharOnly && spacesAfter /= 1 && not isLastNonSpaceChar
= FormattingEdit line (col + 1) line (spaceColcol + 1 + spacesAfter) " "
: go (incParensDepth s) tks
-- Remove extra spaces if it's *not* the first open paren on a new line
| not precededBySingleCharOnly && spacesAfter > 0
= FormattingEdit line (col + 1) line (col + 1 + spacesAfter) ""
: go (incParensDepth s) tks
| otherwise = go (incParensDepth s) tks
where
-- Redacted for brevity
```
where `go` is a helper function containing the core logic used inside `format :: String -> String`.
It simply operates on the current state and the matched token(s) and returns a list
of edits to be applied to the text. The `FormattingEdit` structure provides compatibility
with LSP (for the VS Code extension) and is easy to reuse for the CLI.
Additionally, we try to make sure that these edits perform the minimal changes
needed so that nothing else is affected by mistake (such as comments).

While this implementation is more error-prone than an auto-generated pretty-printer,
it is much more flexible and unit tests have been added that verify the various
rules to make sure no functionality gets broken by mistake. In fact, the unit tests
helped catch a bug with the formatter right as we thought we were ready to release it.

### Implemented rules

The formatter currently implements the following features:

- Add a space after the opening parenthesis to help with the code tree structure
- Put the definition conclusion (type, starting with `:`) and construction (body, starting with `:=`) on new lines
- Add a space after the `\\` of a lambda term and around binary operators (like `,` and `=`)
- Move binary operators to the beginning of the next line if they appear at the end of the previous one
- Replace common ASCII sequences with their Unicode equivalent
- Removes extraneous spaces in some places like before `)` or between `#lang` and `rzk-1` (though it's not comprehensive yet)

### Shortcomings

No software is ever perfect, especially from the first release, and our formatter
is definitely no exception 🙂.

For example, if the source code contained tab characters, instead of throwing its
hands up in the air and angrily shouting "I can't deal with this", the formatter
decides to silently take revenge and mess up your code by deleting random characters
around the tab character, but only every once in a while!
(I really wish I was joking 😅)

Thankfully, the accompanying [VS Code extension](https://marketplace.visualstudio.com/items?itemName=NikolaiKudasovfizruk.rzk-1-experimental-highlighting)
configures some default settings to avoid this situation by always inserting spaces.

Aside from bugs like this, there are also features that are simply not yet implemented.
For example, these include [inserting line breaks](https://github.com/rzk-lang/rzk/issues/147)
in very long lines and
[inferring the correct indentation level](https://github.com/rzk-lang/rzk/issues/148) a line should have.
For the latter, we decided that it might be too constraining to force a specific
indentation on the users when the line might be more readable using a different indentation,
and the formatter is not smart enough to decide when it should adjust indentation or leave it as is.

Lastly, some desirable features have been identified as belonging more to the linter
(which we do not have yet) rather than the formatter, such as detecting unnecessary
parenthesis, deprecated syntax, and expressions that could be simplified.
The linter is currently not on the roadmap yet, but we hope to have it sometime in the future.

## How to use the formatter

The formatter is shipped as part of the `rzk` executable under the subcommand `format`.
It is also integrated in the language server and used by the VS Code extension.
Additionally, the GitHub Action now has an option to check the formatting as well.

### CLI

The command looks like so:
```bash
rzk format [--check] [--write] [file_path]...
```

It follows the convention of other formatters (Prettier and [Fourmolu](https://hackage.haskell.org/package/fourmolu), to name a couple)
in that usage with any flags (`rzk format file.rzk`) outputs the result to `stdout`,
using `--write` will modify the files in place, and `--check` is only for verifying
that the files are already well-formatted (and exits with a non-zero code if they aren't).
Naturally, `--write` and `--check` are mutually exclusive and it would be an error to specify both.

If no file paths are given, then they will be read from `rzk.yaml` if it exists (just like with `typecheck`).

### VS Code

The accompanying release of the VS Code extension integrated formatter support
so you can use the built-in [`Format Document`](https://code.visualstudio.com/docs/editor/codebasics#_formatting)
command to format Rzk and Literate Rzk Markdown files.
It also comes with new default settings for Rzk file that ensure consistent indentation (2 spaces),
enable formatting on save, and add a ruler at 80 chars to help users identify
long lines since the formatter cannot wrap them just yet.

The formatter can be turned off manually by setting the `rzk.format.enable` option to `false`.

### GitHub Action

A new option has been added to the [`rzk-action`](https://github.com/rzk-lang/rzk-action)
called `check-formatting` that, well, _checks_ the _formatting_ of the input files.
It is off by default, and cannot be used with `rzk-version` lower than `v0.7.1`.

## Conclusion and Feedback

This was an exciting milestone for the Rzk toolset and for me personally as I have
never written a similar piece of code before.
That was an interesting engineering and research challenge, and I enjoyed it all the way.

Contributions have been made to popular Rzk formalization repos to autoformat
their content, such as [sHoTT](https://github.com/rzk-lang/sHoTT/pull/142) and the [HoTT Book](https://github.com/rzk-lang/hottbook/pull/16).

I encourage everyone to use the formatter and help us by reporting any possible bugs
on our [GitHub issues](https://github.com/rzk-lang/rzk/issues) page (using the label `formatter`).
There also exists a [Zulip thread](https://rzk-lang.zulipchat.com/#narrow/stream/407112-general/topic/rzk.20v0.2E7.2E*.20with.20autoformatter.20.28experimental.29)
about the autoformatter, so feedback and discussions are welcome there as well.
Many thanks to [Fredrik Bakke](https://github.com/fredrik-bakke) for his helpful feedback
on the [sHoTT formatting PR](https://github.com/rzk-lang/sHoTT/pull/142)!

See you in future blog posts :)

0 comments on commit 81d20ea

Please sign in to comment.