Merge pull request #166 from rzk-lang/blogpost/formatter

Add a post about the formatter
rzk-lang · Jan 7, 2024 · 81d20ea · 81d20ea
2 parents 7f1ba9a + 40632b9
commit 81d20ea
Show file tree

Hide file tree

Showing 3 changed files with 287 additions and 0 deletions.
diff --git a/README.md b/README.md
@@ -0,0 +1,28 @@
+# Rzk blog
+
+## Setup
+
+When setting up MkDocs for the first time, it is recommended to create a Python venv and activate it:
+```bash
+python -m venv venv
+venv\Scripts\Activate.ps1
+```
+
+The activation command will depend on the OS. The command shown above is for PowerShell on Windows.
+
+Then, install the requirements:
+```bash
+pip install -r docs/requirements.txt
+```
+
+## Running
+
+Activate the venv (Windows - PowerShell):
+```bash
+venv\Scripts\Activate.ps1
+```
+
+Run the dev server:
+```bash
+mkdocs serve --config-file docs/config/en/mkdocs.yml
+```
diff --git a/docs/docs/en/.authors.yml b/docs/docs/en/.authors.yml
@@ -4,3 +4,8 @@ authors:
     description: Random guy
     avatar: https://avatars.githubusercontent.com/u/686582
     url: https://github.com/fizruk
+  aabounegm:
+    name: Abdelrahman Abounegm
+    description: Language Server maintainer
+    avatar: https://avatars.githubusercontent.com/u/11016151
+    url: https://github.com/aabounegm
diff --git a/docs/docs/en/posts/2023-12-24-formatter.rzk.md b/docs/docs/en/posts/2023-12-24-formatter.rzk.md
@@ -0,0 +1,254 @@
+---
+authors:
+  - aabounegm
+date: 2023-12-24
+---
+
+# Rzk Formatter
+
+Version 0.7.0 (and later 0.7.3 with bug fixes) of Rzk has recently been released,
+and it brings with it an exciting new feature: the auto formatter!
+It has been under work for over 3 weeks and came with unique design challenges
+that I will attempt to unpack in this post.
+
+<!-- more -->
+
+## Goal
+
+Basically, the target is to write a program that takes in Rzk code as input and
+makes sure it abides by the [style guide](https://rzk-lang.github.io/sHoTT/STYLEGUIDE/) (developed mainly by [Fredrik Bakke](https://github.com/fredrik-bakke) and other contributors to the [sHoTT project](http://github.com/rzk-lang/sHoTT)) with minimal modification otherwise.
+For example, comments should stay intact and semantics should be preserved.
+
+We decided to not support any customization (at least initially), making the
+formatter heavily opinionated.
+
+## Challenges
+
+Designing a formatter has both technical and nontechnical challenges.
+From the nontechnical side, and especially since the formatter is not customizable,
+there is the challenge of having the formatting requirements formulated in a
+precise enough language that can be translated easily into code, assuming there
+is even consensus among the community on these requirements.
+
+From the technical perspective, the parser generator currently used (BNFC) loses
+a lot of the information necessary for reconstructing the source program, so a
+pretty-printer based on the parse result is not feasible (for example, all comments would be lost).
+Even after this difficulty is overcome, there are still more challenges regarding the formatting itself.
+For example, if a given line is too long and it is necessary to wrap it by
+inserting a new line somewhere in the line, there are $O(2^n)$ possible choices
+(where $n$ is the number of locations in the code where a new line could be inserted)
+to compare between for inserting line breaks.
+For example, consider the following Rzk code:
+```rzk
+#define id (A : U) (x : A) : A := x
+```
+which has 11 positions where a new line can be inserted, as shown by these markers:
+```rzk
+#define id (A : U) (x : A) : A := x
+--     ^  ^  ^ ^  ^  ^ ^  ^ ^ ^  ^
+```
+This amounts to 2048 combinations that have to be ranked for such a simple line.
+Even the ranking itself is not a straightforward problem to tackle.
+
+Additionally, to properly indent each line, it is necessary to understand the
+program semantics, which increases the complexity of the formatter's parser.
+
+## Possible approaches (and their tradeoffs)
+
+Multiple possible solutions exist for the above difficulties, each with their pros and cons.
+The alternatives we considered for Rzk's formatter can be summarized as follows:
+
+1. Implement a [Prettier](https://prettier.io/) plugin
+1. Reuse the existing parser generated by BNFC
+1. Write a dedicated parser for the formatter specifically
+1. Reuse the exiting lexer generated by BNFC
+
+### Prettier plugin
+
+Seeing as some formalization projects were already making use of Prettier, a plugin
+to support Rzk seemed like a reasonable option. However, that option didn't seem
+so attractive for a few reasons.
+First of all, it would mean forcing all users to install Prettier, which requires
+having the JavaScript toolchain (in particular, `npm`) installed, but not all Rzk
+developers use JavaScript. It would be better to have an in-house solution.
+
+Additionally, [Prettier plugins](https://prettier.io/docs/en/plugins) can only be
+written in JavaScript, which means having to maintain a separate parser written
+from scratch in JS, which is not maintainable or even desirable.
+
+### BNFC's parser
+
+Rzk makes use of [BNFC](https://bnfc.digitalgrammars.com/) which produces a lexer,
+parser, AST, and pretty-printer for a given _Labelled BNF_ grammar file.
+The generated pretty-printer is not configurable, so it would take a lot of dirty
+patching to have it follow the style guide of Rzk code.
+
+As for the _abstract syntax tree_ (AST) it generates, it is also not suitable
+particularly for being too _abstract_. It loses all information about spaces,
+parentheses, and basically any syntactic construction, which is useful for a
+compiler or typechecker, but not so much for a formatter.
+
+### Dedicated parser
+
+An alternative to the generated parser would be to manually write another parser
+that keeps all the information relevant to the formatter in a **concrete syntax tree** (CST).
+This would be the most customizable option, but it would also be a lot of repeated
+work (since we already have the generated parser) that is additionally more error-prone.
+
+Writing a parser manually is time-consuming and redundant since it can easily be
+automatically generated from the much more readable grammar file.
+
+### BNFC's lexer
+
+Since BNFC generates a lexer that is used by the parser, a more rudimentary
+approach would be to make use of that lexer for our formatter.
+As the lexer simply tokenizes the input source code, it does not make any
+decisions regarding what to throw away and keeps all syntax constructions (except comments).
+Therefore, this approach would have the least amount of syntactic information lost
+among the automatically generated tools.
+
+On the hand, working with a stream of tokens is not as easy as working with a parsed tree.
+First, tokens do not carry almost any semantics with them, so it is difficult to
+understand whether an identifier refers to a parameter, a function application,
+or if it's even still being declared.
+Also, without having semantics, it is difficult to judge out which line split
+would make more sense when inserting line breaks in a long line, or to figure out
+how much a given line should be indented.
+
+## Our chosen solution
+
+The approach that seemed the most suitable of the above was the last one: use BNFC's lexer.
+After going through the style guide a few times, it was clear that the vast majority
+of the rules/recommendations did not involve understanding much semantics.
+For example, Unicode replacement for ASCII sequences, moving `:=` to a new line,
+and adding a space after `\\` in a lambda term can all be applied blindly.
+
+For the few rules that require having some semantics (like distinguishing between `:`
+for definition conclusion and for other types), it was enough to just store some
+state that is preserved (and updated) while traversing the tokens; no additional
+semantic analysis was necessary.
+
+The actual implementation in Haskell is just pattern matching on tokens with some
+conditions and applying edits to the text in response.
+For example, the following is an excerpt of the pattern match for opening parenthesis:
+
+```haskell
+-- Ensure exactly one space after the first open paren of a line
+go :: FormatState -> [Token] -> [FormattingEdit]
+go s (Token "(" line col : tks)
+  | precededBySingleCharOnly && spacesAfter /= 1 && not isLastNonSpaceChar
+    = FormattingEdit line (col + 1) line (spaceColcol + 1 + spacesAfter) " "
+    : go (incParensDepth s) tks
+  -- Remove extra spaces if it's *not* the first open paren on a new line
+  | not precededBySingleCharOnly && spacesAfter > 0
+    = FormattingEdit line (col + 1) line (col + 1 + spacesAfter) ""
+    : go (incParensDepth s) tks
+  | otherwise = go (incParensDepth s) tks
+  where
+    -- Redacted for brevity
+```
+where `go` is a helper function containing the core logic used inside `format :: String -> String`.
+It simply operates on the current state and the matched token(s) and returns a list
+of edits to be applied to the text. The `FormattingEdit` structure provides compatibility
+with LSP (for the VS Code extension) and is easy to reuse for the CLI.
+Additionally, we try to make sure that these edits perform the minimal changes
+needed so that nothing else is affected by mistake (such as comments).
+
+While this implementation is more error-prone than an auto-generated pretty-printer,
+it is much more flexible and unit tests have been added that verify the various
+rules to make sure no functionality gets broken by mistake. In fact, the unit tests
+helped catch a bug with the formatter right as we thought we were ready to release it.
+
+### Implemented rules
+
+The formatter currently implements the following features:
+
+- Add a space after the opening parenthesis to help with the code tree structure
+- Put the definition conclusion (type, starting with `:`) and construction (body, starting with `:=`) on new lines
+- Add a space after the `\\` of a lambda term and around binary operators (like `,` and `=`)
+- Move binary operators to the beginning of the next line if they appear at the end of the previous one
+- Replace common ASCII sequences with their Unicode equivalent
+- Removes extraneous spaces in some places like before `)` or between `#lang` and `rzk-1` (though it's not comprehensive yet)
+
+### Shortcomings
+
+No software is ever perfect, especially from the first release, and our formatter
+is definitely no exception 🙂.
+
+For example, if the source code contained tab characters, instead of throwing its
+hands up in the air and angrily shouting "I can't deal with this", the formatter
+decides to silently take revenge and mess up your code by deleting random characters
+around the tab character, but only every once in a while!
+(I really wish I was joking 😅)
+
+Thankfully, the accompanying [VS Code extension](https://marketplace.visualstudio.com/items?itemName=NikolaiKudasovfizruk.rzk-1-experimental-highlighting)
+configures some default settings to avoid this situation by always inserting spaces.
+
+Aside from bugs like this, there are also features that are simply not yet implemented.
+For example, these include [inserting line breaks](https://github.com/rzk-lang/rzk/issues/147)
+in very long lines and
+[inferring the correct indentation level](https://github.com/rzk-lang/rzk/issues/148) a line should have.
+For the latter, we decided that it might be too constraining to force a specific
+indentation on the users when the line might be more readable using a different indentation,
+and the formatter is not smart enough to decide when it should adjust indentation or leave it as is.
+
+Lastly, some desirable features have been identified as belonging more to the linter
+(which we do not have yet) rather than the formatter, such as detecting unnecessary
+parenthesis, deprecated syntax, and expressions that could be simplified.
+The linter is currently not on the roadmap yet, but we hope to have it sometime in the future.
+
+## How to use the formatter
+
+The formatter is shipped as part of the `rzk` executable under the subcommand `format`.
+It is also integrated in the language server and used by the VS Code extension.
+Additionally, the GitHub Action now has an option to check the formatting as well.
+
+### CLI
+
+The command looks like so:
+```bash
+rzk format [--check] [--write] [file_path]...
+```
+
+It follows the convention of other formatters (Prettier and [Fourmolu](https://hackage.haskell.org/package/fourmolu), to name a couple)
+in that usage with any flags (`rzk format file.rzk`) outputs the result to `stdout`,
+using `--write` will modify the files in place, and `--check` is only for verifying
+that the files are already well-formatted (and exits with a non-zero code if they aren't).
+Naturally, `--write` and `--check` are mutually exclusive and it would be an error to specify both.
+
+If no file paths are given, then they will be read from `rzk.yaml` if it exists (just like with `typecheck`).
+
+### VS Code
+
+The accompanying release of the VS Code extension integrated formatter support
+so you can use the built-in [`Format Document`](https://code.visualstudio.com/docs/editor/codebasics#_formatting)
+command to format Rzk and Literate Rzk Markdown files.
+It also comes with new default settings for Rzk file that ensure consistent indentation (2 spaces),
+enable formatting on save, and add a ruler at 80 chars to help users identify
+long lines since the formatter cannot wrap them just yet.
+
+The formatter can be turned off manually by setting the `rzk.format.enable` option to `false`.
+
+### GitHub Action
+
+A new option has been added to the [`rzk-action`](https://github.com/rzk-lang/rzk-action)
+called `check-formatting` that, well, _checks_ the _formatting_ of the input files.
+It is off by default, and cannot be used with `rzk-version` lower than `v0.7.1`.
+
+## Conclusion and Feedback
+
+This was an exciting milestone for the Rzk toolset and for me personally as I have
+never written a similar piece of code before.
+That was an interesting engineering and research challenge, and I enjoyed it all the way.
+
+Contributions have been made to popular Rzk formalization repos to autoformat
+their content, such as [sHoTT](https://github.com/rzk-lang/sHoTT/pull/142) and the [HoTT Book](https://github.com/rzk-lang/hottbook/pull/16).
+
+I encourage everyone to use the formatter and help us by reporting any possible bugs
+on our [GitHub issues](https://github.com/rzk-lang/rzk/issues) page (using the label `formatter`).
+There also exists a [Zulip thread](https://rzk-lang.zulipchat.com/#narrow/stream/407112-general/topic/rzk.20v0.2E7.2E*.20with.20autoformatter.20.28experimental.29)
+about the autoformatter, so feedback and discussions are welcome there as well.
+Many thanks to [Fredrik Bakke](https://github.com/fredrik-bakke) for his helpful feedback
+on the [sHoTT formatting PR](https://github.com/rzk-lang/sHoTT/pull/142)!
+
+See you in future blog posts :)