Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Make Vale's format support extensible #769

Open
jdkato opened this issue Feb 15, 2024 · 15 comments
Open

Make Vale's format support extensible #769

jdkato opened this issue Feb 15, 2024 · 15 comments

Comments

@jdkato
Copy link
Member

jdkato commented Feb 15, 2024

See #688 (comment) for a related comment.

@XVilka
Copy link

XVilka commented Mar 15, 2024

One way to do that is to work with Tree-Sitter grammars the way NeoVim does—it will allow you to analyze comments in every language, depending on the syntax or so-called "injections".

@weaversam8
Copy link

+1 for tree-sitter- there are grammars for nearly every language and tree-sitter makes it easy to design your own.

@jdkato
Copy link
Member Author

jdkato commented Jun 27, 2024

Many of Vale's supported programming languages now use Tree-sitter parsers.

I think an interesting way to make this "extensible" would be to expose the underlying queries, allowing users to decide exactly what parts of the file is linted.

@feasgal
Copy link

feasgal commented Sep 10, 2024

+1000 for this issue. Here's my use case.

Vale ignores indented blocks, fenced blocks, and code spans by default. (docs)

We use MkDocs to generate our documentation. One of the most common themes for MkDocs (and the one we use) is the Material theme, where line indentation can indicate many things other than a code block. Currently, vale ignores content in an admonition, a longer footnote, a definition list with more than one paragraph, a nested annotation, or content tabs, to name a few common examples.

We have a convention never to use indentation in the markdown to format a code block in the built HTML, but only use backtick fences. This matches Material's instructions, which don't even mention that indentation would work (presumably to avoid confusion with the many other ways it uses indentation). If code ignoring were not only configurable, but separately configurable for indented blocks, fenced blocks, and code spans, that would solve the problem completely.

But even if it were just possible to configure the whole code ignore as true or false, I could at least turn it off and then write a custom rule to ignore only the fences and/or code spans.

@geemus
Copy link

geemus commented Nov 14, 2024

I have maybe a slightly different angle of something I think I'd like here.

We use golden files for testing our CLI output and I'm interested in having the ability to lint these to help us enforce a style guide. The output doesn't really confirm to an existing format type, though it is vaguely markdown-like.

As a first pass, I've just mapped .golden to .md for vale, which broadly works but doesn't really provide the kind of scoping I would hope for. Unfortunately, something like treesitter isn't going to either since it isn't really something we can do in an existing format per se.

What I think I might want would be writing and integrating a script to convert from our format to a known-good/parseable format. I think for instance that this is more along the lines of what the asciidoc/asciidoctor format does (converting to HTML5 if I'm not mistaken). Is this an area of possible extension as well?

In lieu of that, I can do that conversion and linting through scripting, but it wasn't apparent to me how I would do that with lsp setup in my neovim. If there is some more direct way to do this already available I'd also be all ears.

@jdkato
Copy link
Member Author

jdkato commented Jan 7, 2025

I've put some work into a draft of support for JSON, YAML, and TOML -- a much requested feature (#383, #412, #911, #922).

As I've said in those issues, the challenge here is that having a syntactic-level understanding of the format isn't enough: we need to know which key(s) we want to lint and how to lint them.

For example, consider a simple OpenAPI file:

openapi: 3.0.0
info:
  title: Sample API
  description: Optional multiline or single-line description in [CommonMark](http://commonmark.org/help/) or HTML.
  version: 0.1.9

servers:
  - url: http://api.example.com/v1
    description: Optional server description, e.g. Main (production) serrver
  - url: http://staging-api.example.com
    description: Optional server description, e.g. Internal staging 

In such a file, we might only want to lint the title and description values while ignoring everything else. It's possible to achieve this with an external script that calls Vale at appropriate times, but maintaining location information is non-trivial.

My solution here is a concept with the working title of "blueprints." A blueprint is a YAML file that allows you specify an arbitrary subset of a file to lint using Dasel:

# OpenAPI.yml
select:
  - name: Metadata
    query: info.description

  - name: Servers
    query: servers.all().description

You then reference this file in your .vale.ini:

[formats]
# Treat the extracted text as Markdown.
yml = md

[*.yml]
BasedOnStyles = Vale

Blueprint = OpenAPI

The end result is the ability to lint any keys you want, while maintaining location information and support for nested markup content (commonly HTML or Markdown).

@jdkato jdkato pinned this issue Jan 7, 2025
@aireilly
Copy link

aireilly commented Jan 7, 2025

^^ That looks excellent and exactly the kind of thing I would imagine using while working with an engineering team and trying to lint these files.

@alecthegeek
Copy link
Contributor

alecthegeek commented Jan 7, 2025

Different files may need different blueprints.

For example I might have OpenAPI YAML file(s) and GitLab CI/Cd YAML file in my project.

Can these blueprints be applied at that level of granularity? e.g.

[formats]
# Treat the extracted text as Markdown.
yml = md

[/api/petstore.yml]
BasedOnStyles = Vale

Blueprint = OpenAPI

[/.gitlab-ci.yml]
BasedOnStyles = Vale

Blueprint = GitLab-CI

If we wanted to be greedy we might even ask for wildcard support (consider an OpenAPI spec that is in located in multiple files) as well as named multiple files. For example

[formats]
# Treat the extracted text as Markdown.
yml = md

[/api/**.yml]
; All yaml files under the API directory. Includes all subdirectories
BasedOnStyles = Vale

Blueprint = OpenAPI

[{/.gitlab-ci.yml,/.gitlab/gitlab-ci-production.yml}]
; Our CI/CD config is located in two specific files
BasedOnStyles = Vale

Blueprint = GitLab-CI

@jdkato
Copy link
Member Author

jdkato commented Jan 8, 2025

Yes, any form of typical .vale.ini configuration will be supported.

I've also changed the file format slightly, so that you can (1) assign a custom scope to any values you've extracted and (2) use tree-sitter queries on supported formats:

# OpenAPI.yml
engine: dasel
steps:
    # This allows us to, for example, write rules specifically
    # targeting titles. 
  - scope: title
    query: info.title
  - query: info.description
  - query: servers.all().description
# Python.yml
engine: tree-sitter
# An example that changes how Python files are linted.
steps:
  - query: (comment)+ @comment

@jdkato
Copy link
Member Author

jdkato commented Jan 8, 2025

@feasgal, @geemus: Both of those cases could be addressed by allowing the user to provide their own conversion script -- e.g., you could provide a Python script that uses PyMdown Extensions to ensure support with MkDocs.

I've avoided such a feature for two reasons: calling external scripts is slow and it means that packages would then be able to execute arbitrary code.

One idea would be to not allow such scripts to be included in a package, meaning users would have to install it independently (similar to how we handle rst2html and asciidoctor).

@geemus
Copy link

geemus commented Jan 9, 2025

@jdkato That concern about external makes good sense to me. Similarly, I think for my purposes having a way to in-effect plug in my own asciidoctor-like script would be super helpful. Keeping it as something that the user would need to explicitly/distinctly do and not a default also seems quite sensible to mitigate security concerns.

@infotexture
Copy link

@jdkato Thanks for investing the additional effort here. 🙇

The "blueprints" idea sounds very versatile, and should be able to solve many of the use cases in the linked issues. 🙏

From what you wrote above, I assume some sort of wildcard syntax could be used in blueprints to ignore the keys in each "key": "value" pair, and only run on the values as discussed in #922.

@infotexture
Copy link

The working title of “blueprints” evokes building something up based on a template, which sounds like an additive process.

If the idea is to extract something from whatever passes through it, you might consider calling these “filters”, since you'd effectively be filtering out a subset of the file to lint.

@jdkato
Copy link
Member Author

jdkato commented Jan 9, 2025

From what you wrote above, I assume some sort of wildcard syntax could be used in blueprints to ignore the keys in each "key": "value" pair, and only run on the values as discussed in #922.

You'll be able to use wildcard syntax in your .vale.ini to assign a blueprint to file(s).

The concept itself involves writing queries that use keys to select specific values to lint. The keys themselves aren't part of the linting process though.

It's similar to using jq, for those that are familiar.

The working title of “blueprints” evokes building something up based on a template, which sounds like an additive process.

If the idea is to extract something from whatever passes through it, you might consider calling these “filters”, since you'd effectively be filtering out a subset of the file to lint.

I agree mostly. Part of the challenge with naming is not reusing terms from Vale's ever growing list of features. In this case, "filters" are used to describe expressions that allow you to report a subset of your configuration,

@infotexture
Copy link

The concept itself involves writing queries that use keys to select specific values to lint.
The keys themselves aren't part of the linting process though.

That should work well, as long as there's a way to use some sort of wildcard pattern in the blueprint to lint the values for all the keys without having to specify each key by name (which can change over time and may not be known).

"filters" are used to describe expressions that allow you to report a subset of your configuration

Got it. Hadn't used those yet, so wasn't aware that term is already "taken". Makes sense.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

8 participants