-
Notifications
You must be signed in to change notification settings - Fork 10
Commit
This commit does not belong to any branch on this repository, and may belong to a fork outside of the repository.
Merge pull request #166 from rzk-lang/blogpost/formatter
Add a post about the formatter
- Loading branch information
Showing
3 changed files
with
287 additions
and
0 deletions.
There are no files selected for viewing
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,28 @@ | ||
# Rzk blog | ||
|
||
## Setup | ||
|
||
When setting up MkDocs for the first time, it is recommended to create a Python venv and activate it: | ||
```bash | ||
python -m venv venv | ||
venv\Scripts\Activate.ps1 | ||
``` | ||
|
||
The activation command will depend on the OS. The command shown above is for PowerShell on Windows. | ||
|
||
Then, install the requirements: | ||
```bash | ||
pip install -r docs/requirements.txt | ||
``` | ||
|
||
## Running | ||
|
||
Activate the venv (Windows - PowerShell): | ||
```bash | ||
venv\Scripts\Activate.ps1 | ||
``` | ||
|
||
Run the dev server: | ||
```bash | ||
mkdocs serve --config-file docs/config/en/mkdocs.yml | ||
``` |
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,254 @@ | ||
--- | ||
authors: | ||
- aabounegm | ||
date: 2023-12-24 | ||
--- | ||
|
||
# Rzk Formatter | ||
|
||
Version 0.7.0 (and later 0.7.3 with bug fixes) of Rzk has recently been released, | ||
and it brings with it an exciting new feature: the auto formatter! | ||
It has been under work for over 3 weeks and came with unique design challenges | ||
that I will attempt to unpack in this post. | ||
|
||
<!-- more --> | ||
|
||
## Goal | ||
|
||
Basically, the target is to write a program that takes in Rzk code as input and | ||
makes sure it abides by the [style guide](https://rzk-lang.github.io/sHoTT/STYLEGUIDE/) (developed mainly by [Fredrik Bakke](https://github.com/fredrik-bakke) and other contributors to the [sHoTT project](http://github.com/rzk-lang/sHoTT)) with minimal modification otherwise. | ||
For example, comments should stay intact and semantics should be preserved. | ||
|
||
We decided to not support any customization (at least initially), making the | ||
formatter heavily opinionated. | ||
|
||
## Challenges | ||
|
||
Designing a formatter has both technical and nontechnical challenges. | ||
From the nontechnical side, and especially since the formatter is not customizable, | ||
there is the challenge of having the formatting requirements formulated in a | ||
precise enough language that can be translated easily into code, assuming there | ||
is even consensus among the community on these requirements. | ||
|
||
From the technical perspective, the parser generator currently used (BNFC) loses | ||
a lot of the information necessary for reconstructing the source program, so a | ||
pretty-printer based on the parse result is not feasible (for example, all comments would be lost). | ||
Even after this difficulty is overcome, there are still more challenges regarding the formatting itself. | ||
For example, if a given line is too long and it is necessary to wrap it by | ||
inserting a new line somewhere in the line, there are $O(2^n)$ possible choices | ||
(where $n$ is the number of locations in the code where a new line could be inserted) | ||
to compare between for inserting line breaks. | ||
For example, consider the following Rzk code: | ||
```rzk | ||
#define id (A : U) (x : A) : A := x | ||
``` | ||
which has 11 positions where a new line can be inserted, as shown by these markers: | ||
```rzk | ||
#define id (A : U) (x : A) : A := x | ||
-- ^ ^ ^ ^ ^ ^ ^ ^ ^ ^ ^ | ||
``` | ||
This amounts to 2048 combinations that have to be ranked for such a simple line. | ||
Even the ranking itself is not a straightforward problem to tackle. | ||
|
||
Additionally, to properly indent each line, it is necessary to understand the | ||
program semantics, which increases the complexity of the formatter's parser. | ||
|
||
## Possible approaches (and their tradeoffs) | ||
|
||
Multiple possible solutions exist for the above difficulties, each with their pros and cons. | ||
The alternatives we considered for Rzk's formatter can be summarized as follows: | ||
|
||
1. Implement a [Prettier](https://prettier.io/) plugin | ||
1. Reuse the existing parser generated by BNFC | ||
1. Write a dedicated parser for the formatter specifically | ||
1. Reuse the exiting lexer generated by BNFC | ||
|
||
### Prettier plugin | ||
|
||
Seeing as some formalization projects were already making use of Prettier, a plugin | ||
to support Rzk seemed like a reasonable option. However, that option didn't seem | ||
so attractive for a few reasons. | ||
First of all, it would mean forcing all users to install Prettier, which requires | ||
having the JavaScript toolchain (in particular, `npm`) installed, but not all Rzk | ||
developers use JavaScript. It would be better to have an in-house solution. | ||
|
||
Additionally, [Prettier plugins](https://prettier.io/docs/en/plugins) can only be | ||
written in JavaScript, which means having to maintain a separate parser written | ||
from scratch in JS, which is not maintainable or even desirable. | ||
|
||
### BNFC's parser | ||
|
||
Rzk makes use of [BNFC](https://bnfc.digitalgrammars.com/) which produces a lexer, | ||
parser, AST, and pretty-printer for a given _Labelled BNF_ grammar file. | ||
The generated pretty-printer is not configurable, so it would take a lot of dirty | ||
patching to have it follow the style guide of Rzk code. | ||
|
||
As for the _abstract syntax tree_ (AST) it generates, it is also not suitable | ||
particularly for being too _abstract_. It loses all information about spaces, | ||
parentheses, and basically any syntactic construction, which is useful for a | ||
compiler or typechecker, but not so much for a formatter. | ||
|
||
### Dedicated parser | ||
|
||
An alternative to the generated parser would be to manually write another parser | ||
that keeps all the information relevant to the formatter in a **concrete syntax tree** (CST). | ||
This would be the most customizable option, but it would also be a lot of repeated | ||
work (since we already have the generated parser) that is additionally more error-prone. | ||
|
||
Writing a parser manually is time-consuming and redundant since it can easily be | ||
automatically generated from the much more readable grammar file. | ||
|
||
### BNFC's lexer | ||
|
||
Since BNFC generates a lexer that is used by the parser, a more rudimentary | ||
approach would be to make use of that lexer for our formatter. | ||
As the lexer simply tokenizes the input source code, it does not make any | ||
decisions regarding what to throw away and keeps all syntax constructions (except comments). | ||
Therefore, this approach would have the least amount of syntactic information lost | ||
among the automatically generated tools. | ||
|
||
On the hand, working with a stream of tokens is not as easy as working with a parsed tree. | ||
First, tokens do not carry almost any semantics with them, so it is difficult to | ||
understand whether an identifier refers to a parameter, a function application, | ||
or if it's even still being declared. | ||
Also, without having semantics, it is difficult to judge out which line split | ||
would make more sense when inserting line breaks in a long line, or to figure out | ||
how much a given line should be indented. | ||
|
||
## Our chosen solution | ||
|
||
The approach that seemed the most suitable of the above was the last one: use BNFC's lexer. | ||
After going through the style guide a few times, it was clear that the vast majority | ||
of the rules/recommendations did not involve understanding much semantics. | ||
For example, Unicode replacement for ASCII sequences, moving `:=` to a new line, | ||
and adding a space after `\\` in a lambda term can all be applied blindly. | ||
|
||
For the few rules that require having some semantics (like distinguishing between `:` | ||
for definition conclusion and for other types), it was enough to just store some | ||
state that is preserved (and updated) while traversing the tokens; no additional | ||
semantic analysis was necessary. | ||
|
||
The actual implementation in Haskell is just pattern matching on tokens with some | ||
conditions and applying edits to the text in response. | ||
For example, the following is an excerpt of the pattern match for opening parenthesis: | ||
|
||
```haskell | ||
-- Ensure exactly one space after the first open paren of a line | ||
go :: FormatState -> [Token] -> [FormattingEdit] | ||
go s (Token "(" line col : tks) | ||
| precededBySingleCharOnly && spacesAfter /= 1 && not isLastNonSpaceChar | ||
= FormattingEdit line (col + 1) line (spaceColcol + 1 + spacesAfter) " " | ||
: go (incParensDepth s) tks | ||
-- Remove extra spaces if it's *not* the first open paren on a new line | ||
| not precededBySingleCharOnly && spacesAfter > 0 | ||
= FormattingEdit line (col + 1) line (col + 1 + spacesAfter) "" | ||
: go (incParensDepth s) tks | ||
| otherwise = go (incParensDepth s) tks | ||
where | ||
-- Redacted for brevity | ||
``` | ||
where `go` is a helper function containing the core logic used inside `format :: String -> String`. | ||
It simply operates on the current state and the matched token(s) and returns a list | ||
of edits to be applied to the text. The `FormattingEdit` structure provides compatibility | ||
with LSP (for the VS Code extension) and is easy to reuse for the CLI. | ||
Additionally, we try to make sure that these edits perform the minimal changes | ||
needed so that nothing else is affected by mistake (such as comments). | ||
|
||
While this implementation is more error-prone than an auto-generated pretty-printer, | ||
it is much more flexible and unit tests have been added that verify the various | ||
rules to make sure no functionality gets broken by mistake. In fact, the unit tests | ||
helped catch a bug with the formatter right as we thought we were ready to release it. | ||
|
||
### Implemented rules | ||
|
||
The formatter currently implements the following features: | ||
|
||
- Add a space after the opening parenthesis to help with the code tree structure | ||
- Put the definition conclusion (type, starting with `:`) and construction (body, starting with `:=`) on new lines | ||
- Add a space after the `\\` of a lambda term and around binary operators (like `,` and `=`) | ||
- Move binary operators to the beginning of the next line if they appear at the end of the previous one | ||
- Replace common ASCII sequences with their Unicode equivalent | ||
- Removes extraneous spaces in some places like before `)` or between `#lang` and `rzk-1` (though it's not comprehensive yet) | ||
|
||
### Shortcomings | ||
|
||
No software is ever perfect, especially from the first release, and our formatter | ||
is definitely no exception 🙂. | ||
|
||
For example, if the source code contained tab characters, instead of throwing its | ||
hands up in the air and angrily shouting "I can't deal with this", the formatter | ||
decides to silently take revenge and mess up your code by deleting random characters | ||
around the tab character, but only every once in a while! | ||
(I really wish I was joking 😅) | ||
|
||
Thankfully, the accompanying [VS Code extension](https://marketplace.visualstudio.com/items?itemName=NikolaiKudasovfizruk.rzk-1-experimental-highlighting) | ||
configures some default settings to avoid this situation by always inserting spaces. | ||
|
||
Aside from bugs like this, there are also features that are simply not yet implemented. | ||
For example, these include [inserting line breaks](https://github.com/rzk-lang/rzk/issues/147) | ||
in very long lines and | ||
[inferring the correct indentation level](https://github.com/rzk-lang/rzk/issues/148) a line should have. | ||
For the latter, we decided that it might be too constraining to force a specific | ||
indentation on the users when the line might be more readable using a different indentation, | ||
and the formatter is not smart enough to decide when it should adjust indentation or leave it as is. | ||
|
||
Lastly, some desirable features have been identified as belonging more to the linter | ||
(which we do not have yet) rather than the formatter, such as detecting unnecessary | ||
parenthesis, deprecated syntax, and expressions that could be simplified. | ||
The linter is currently not on the roadmap yet, but we hope to have it sometime in the future. | ||
|
||
## How to use the formatter | ||
|
||
The formatter is shipped as part of the `rzk` executable under the subcommand `format`. | ||
It is also integrated in the language server and used by the VS Code extension. | ||
Additionally, the GitHub Action now has an option to check the formatting as well. | ||
|
||
### CLI | ||
|
||
The command looks like so: | ||
```bash | ||
rzk format [--check] [--write] [file_path]... | ||
``` | ||
|
||
It follows the convention of other formatters (Prettier and [Fourmolu](https://hackage.haskell.org/package/fourmolu), to name a couple) | ||
in that usage with any flags (`rzk format file.rzk`) outputs the result to `stdout`, | ||
using `--write` will modify the files in place, and `--check` is only for verifying | ||
that the files are already well-formatted (and exits with a non-zero code if they aren't). | ||
Naturally, `--write` and `--check` are mutually exclusive and it would be an error to specify both. | ||
|
||
If no file paths are given, then they will be read from `rzk.yaml` if it exists (just like with `typecheck`). | ||
|
||
### VS Code | ||
|
||
The accompanying release of the VS Code extension integrated formatter support | ||
so you can use the built-in [`Format Document`](https://code.visualstudio.com/docs/editor/codebasics#_formatting) | ||
command to format Rzk and Literate Rzk Markdown files. | ||
It also comes with new default settings for Rzk file that ensure consistent indentation (2 spaces), | ||
enable formatting on save, and add a ruler at 80 chars to help users identify | ||
long lines since the formatter cannot wrap them just yet. | ||
|
||
The formatter can be turned off manually by setting the `rzk.format.enable` option to `false`. | ||
|
||
### GitHub Action | ||
|
||
A new option has been added to the [`rzk-action`](https://github.com/rzk-lang/rzk-action) | ||
called `check-formatting` that, well, _checks_ the _formatting_ of the input files. | ||
It is off by default, and cannot be used with `rzk-version` lower than `v0.7.1`. | ||
|
||
## Conclusion and Feedback | ||
|
||
This was an exciting milestone for the Rzk toolset and for me personally as I have | ||
never written a similar piece of code before. | ||
That was an interesting engineering and research challenge, and I enjoyed it all the way. | ||
|
||
Contributions have been made to popular Rzk formalization repos to autoformat | ||
their content, such as [sHoTT](https://github.com/rzk-lang/sHoTT/pull/142) and the [HoTT Book](https://github.com/rzk-lang/hottbook/pull/16). | ||
|
||
I encourage everyone to use the formatter and help us by reporting any possible bugs | ||
on our [GitHub issues](https://github.com/rzk-lang/rzk/issues) page (using the label `formatter`). | ||
There also exists a [Zulip thread](https://rzk-lang.zulipchat.com/#narrow/stream/407112-general/topic/rzk.20v0.2E7.2E*.20with.20autoformatter.20.28experimental.29) | ||
about the autoformatter, so feedback and discussions are welcome there as well. | ||
Many thanks to [Fredrik Bakke](https://github.com/fredrik-bakke) for his helpful feedback | ||
on the [sHoTT formatting PR](https://github.com/rzk-lang/sHoTT/pull/142)! | ||
|
||
See you in future blog posts :) |