Skip to content

Adding a New Language to Combobulate

Mickey Petersen edited this page Sep 27, 2024 · 2 revisions

So you're decided to extend Combobulate so it supports another language? That's great. Thank you for helping out. I really appreciate it!

Contribution Rules

Before I explain how to go about doing this, you should keep a few things a mind, particularly if you want to contribute it to Combobulate's core repo. (I am working on getting Combobulate on a package managed service to make it easier to have external dependencies)

The Rules:

  1. You should ensure support is nigh-complete. It's easy to miss odds and ends; that's fine. I am sure I have, too, still, with some of the existing languages.

    However, defun, logical, sibling and hierarchical navigation must work cleanly and properly everywhere. Adding support for sequences and sexp is optional, as not every language needs it.

  2. Sibling navigation is critical. It is used in far more contexts than you probably assume it is. It is a foundational concept in Combobulate. Read Combobulate: Intuitive, Structured Navigation with Tree-Sitter for more information.

  3. You will be the primary contact if there are bug reports and issues with the language. I will do what I can to help behind the scenes, and I will try to fix things on your behalf also, but it is not an iron clad guarantee that I will respond in time or deal with esoteric language requests.

  4. If your language has an official major mode in Emacs core (be it Emacs 29 or a later version) that you build your support in Combobulate with that grammar version in mind. Grammars are loosely (and poorly) versioned and it is very difficult to tell them apart as the compiled grammars don't export a version tag for stultifying reasons I cannot fathom.

    If there is no such mode in core, you would do well to reach out to any third-party tree-sitter major mode author(s) and ensure there's some sort of convergence or agreement on the version you wish to use.

    Regardless, you'll have to make a note of the version you are building against and remember it.

  5. You are required to write tests that test most of the functionality you are asked to implement as per #2 above.

    Not to worry: this is actually super easy and requires little in the way of actual coding, as I will explain soon.

Background Info

Extending Combobulate is relatively easy, provided you know a little bit about the procedure DSL it uses, and a general idea of how Combobulate is supposed to work.

Do read Combobulate: Intuitive, Structured Navigation with Tree-Sitter first. It lays some very important groundwork and it explains why you have to use the procedure DSL in the first place.

Unlike a lot of other "structured editing" packages, Combobulate does not require hand-tuned code to work with your language. It goes against the philosophy of Combobulate:

Every language must declaratively describe its relationship with the underlying concrete syntax tree using a simple DSL. Combobulate has no affordances for shims or quick hacks to make certain features work better with a language.

Its design is set up in a such a way that, provided you can correctly declare the procedural rules (see below), your language will work with Combobulate's editing and movement facilities with no work on your part.

But first, let's talk about how the grammars are built. It'll help understand why Combobulate has to do what it does.

A Quick Overview of How the Grammars are Built

Tree-sitter grammars are built principally in C, but driven by a Javascript 'grammar specifier' that works a little bit like EBNF. If you're not familiar with Extended Backus-Naur Format, it's a formal description of a language --- or it would be, if most languages actually kept up-to-date EBNFs of their language's syntax. Most don't. Most languages' EBNF miss out on a lot of subtlety and language design complexity that you can only really discover by looking at the official compiler/interpreter for that language.

One of the bigger problems with writing a TS grammar is that information on language syntax is often sparse or reverse-engineered. Some languages defy all formalization because some aspects of their language are unspecifiable: the preprocessor in C is a good example. To truly parse C you must run the preprocessor code to derive the actual C code your compiler sees. TS does not do that, so it, ironically enough, is useless when you're using it with Emacs's own source code.

The Concrete Syntax Tree (CST) that TS yields is a compromise. As a grammar writer you have to make certain trade-offs when you decide on the final structure of your tree. You will most likely have to do all sorts of contortions to ensure that gnarly precedence-rules are properly captured if there are ambiguities in your language. That (along with simply having lots of production rules) will result in an unwieldy and memory-hungry tree.

Some production rules are pruned or merged into inlined rules and supertypes. Both typically start with _, such as _statement or _expression. That is a brief summary of what happens.

This is very much a tastemaker thing: a grammar writer makes a reasonable guess at what a language's CST should or should not have, and hopes that is good enough: specific enough to parse actual, real-life prose or code with a level of clarity and detail you can actually use; and generic enough that it's that it's possible to write coherent tree-sitter queries for syntax highlighting or what have you.

Which then brings me to an annoying part of the tree-sitter grammars' development cycle: they often move stuff around or outright change node structure in point releases, as is their wont, I suppose.

Annoyingly -- very much so -- invalid tree-sitter queries break the whole tree-sitter query. It's common to write large queries using the alternation clause [ Q_1 ... Q_N ] to capture as many 'like' queries as possible in one go: it's more efficient that way. But if one of those queries fail, the whole query fails.

Emacs's poor tree-sitter font lock engine bears some of the responsibility here: if it encounters a bad query it seizes up and stops font locking altogether instead of recovering somewhat gracefully by continuing with the other rules.

So. You will have to deal with broken queries and changing tree structure at some point. There is no formal or best guidance -- yet -- on how (collectively) that should work.

When it's all said and done, you're left with a bunch of auto-generated files. This includes the parser.c that was generated from your Javascript 'grammar' (and optional, handwritten scanner.c) and the two files that are germane to Combobulate: the grammar.json and node-types.json files.

The Node Types and Grammar JSON Files

The node-types.json and grammar.json files contain bowdlerized versions of the grammar you wrote. It is not possible to truly reconstruct the parser using the "grammar" found in grammar.json, but it does contain a rough outline of the production rules: terminals and non-terminals and how they fit together.

Presently, it is not possible to ask tree-sitter (and therefore the grammar) what node types it has. Tree-sitter cannot tell you its node types nor its relationships directly, despite having some vague notion of them in the query engine.

Imagine if we could ask a tree-sitter grammar library to give us its grammar.json and node-types.json file? Think of the hardships we could avoid around shifting syntax structures; broken font lock queries; and manual parsing.

Yeah, well, we can't.

To work around this, the node-types.json file is parsed by the build-relationships.py and sources.ini file in Combobulate's repo.

A bunch of node information, including parental relationships between nodes, is extracted and a combobulate-rules.el file is generated with all the grammars known to Combobulate and build-relationship.py in sources.ini.

This information is very important and Combobulate cannot function without it. The query builder (C-c o B q) uses it for syntax highlighting and code completion; and the procedure system uses it to match node types.

(Note: It's a long-standing task of mine to rewrite build-relationships.py as an elisp script to remove the Python dependency.)

Open up combobulate-rules.el and you'll see that it's just a bunch of nodes and their parental relationships and the fields (if any) that are associated with them.

Right. Let's start adding support for a language.

Configuring Combobulate

First things first. Clone Combobulate and use the development branch. It's often ahead of master.

Next, you're going to want to load some extra modules not loaded by default. Here's an example elisp snippet. By all means customize it to suit your needs:

(use-package combobulate
    :config
    (require 'combobulate-test-prelude)
    (require 'combobulate-debug)
    :load-path ("~/.emacs.d/packages/combobulate/"
                "~/.emacs.d/packages/combobulate/tests/"))

You'll need to load combobulate-test-prelude.el found in (tests/) and also combobulate-debug.el.

Rebuilding the Rules

Next, you'll need to build the relationships for your language. Open sources.ini and add the respective raw github URLs to the version files.

Now run, with Python 3.11+, build-relationships.py to download and compile the new combobulate-rules.el file.

Once that's working you can safely re-evaluate combobulate-rules.el and now you can use C-c o B q to interactively build queries in your language. If it doesn't work, restart your Emacs and try again.

Default Definitions and Shorthands

Open combobulate-setup.el and look for the defconst form called combobulate-default-definitions.

That is the list of definitions that makes up a language. Not all definitions have to be changed or even set for every language. Some must be set by you. There is no way of seeing which by looking at the definitions list, unfortunately, but I will outline the ones that matter most here.

This defconst form is used to generate the variables for your language. The association keys you see in that form (procedures-defun, etc.) are called shorthands. Open up any Combobulate-supported buffer and type M-: (combobulate-read SHORTHAND) to read the value of any shorthand definition.

You do not need to edit anything here, but it is useful to know where Combobulate sources its definitions from.

Later, when we generate the language definition with define-combobulate-language, this default definition profile plus your own definitions are used to generate a large tract of defvar and defcustom forms that make up your language's definitions.

Combobulate didn't always use this system. It used an awful lot of buffer-local variables which made overriding and customizing things (for end users) difficult and not very maintainable. This new system generates a handful of customize-friendly settings (see M-x customize-group RET combobulate RET.)

Definitions and What They Mean

These are the definitions you must know about. You won't always have to override all of them, but most procedures should be set.

procedures-defun

This is a list of procedures that determine what a defun is. In most languages it's functions and classes; in things like Markdown, it might be the headers. C-M-a and C-M-e are two such commands along with mark-defun bound to C-M-h.

procedures-hierarchy

This is a list of procedures that determine the parent-child relationship between nodes. Specifically C-M-d and C-M-u. You will need this to ensure smooth navigation between major elements of the language. For instance, between nested HTML elements; moving from a function definition into its actual code; and so on.

procedures-default

This is the default procedure to apply if there are no other applicable procedures. This defaults to all possible node types. This is a sensible default to start with.

procedures-sequence

Sequences are things that are not siblings but still tangibly related. For instance, the tag name in HTML or JSX tags. They are not next to each other (indeed, the beginning and end tags may be far apart with lots of other things between them) and yet they are connected.

What constitutes a sequence is debatable. HTML tags do; but perhaps other things do as well. It's OK to leave this blank.

Sequence navigation uses M-n and M-p.

display-ignored-node-types

Combobulate shows a tree outline in the echo area. Some node types are undesirable and must be hidden. This is most often the case when you pretty-print a parent node so it includes information sourced from a child node.

Leaving it blank is usually fine.

plausible-separators

Combobulate does not yet use grammar.json to recall the relationship between nodes, particularly when it comes to terminals (also called "anonymous nodes" in tree-sitter) that are frequently used to separate items --- like , often is in a list of array elements.

You should enter the most plausible separators used in your language here. Usually that's ,.

procedure-discard-rules

List of rules that Combobulate must automatically flag as discarded when it encounters them during a procedure search. This is nearly always comment, if comment is a line-based comment. Line-based comments occupy a weird place in a CST, and they can muck up Combobulate.

indent-after-edit

Do not set.

indent-calculate-function

Function that calculates the baseline indentation for a given position. This is of interest only to whitespace-sensitive languages like Python.

envelope-indent-region-function

The function to use to indent a region. Defaults to indent-region which is fine if you're not using a whitespace-sensitive language.

envelope-deindent-function

As above, but for deindentation of whitespace-sensitive languages.

envelope-procedure-shorthand-alist

An envelope is a code template. Envelopes use the procedure system to determine how to expand a template such that it captures (wraps) the right node at point if there is no region active. For instance, C-c o e t wraps an HTML tag in another tag, and the procedure rule is set so that it only works inside HTML tags.

Many envelopes often use the same expansion rules; this shorthand alist is a mapping of a symbol (the key) and a procedure (the value).

envelope-procedure-shorthand-default-alist

Default shorthands for envelopes to always make available in languages.

envelope-default-list

Envelopes to create in each language by default. There's usually only one such envelope: M-).

envelope-list

List of envelopes that this language supports. See other languages' definitions for examples.

highlight-queries-default

Combobulate can highlight queries (by installing new font lock rules, essentially) by default.

You can easily leave this blank.

context-nodes

Context nodes is a list of node types that are contextual in your language. For example, identifier is typically a leaf node that holds the name of a variable or the name of a function. That makes it contextual: it holds important semantic information about something.

Put all the node types that host important semantic information such as property names, constant values, and so on. It's rarely more than a dozen of nodes; in some languages it's just one or two.

pretty-print-node-name-function

You can pretty print the display of node names in many places in Combobulate. Use your own function here to do this.

procedures-logical

Logical navigation is bound to M-a and M-e. These commands move to the next logical node after or before point. It defaults to all possible nodes types, and this is usually the right default.

procedures-sibling

Sibling navigation really means picking the right siblings as point will often intersect many nodes, each having its own siblings. Sibling navigation is essential to get right and it must work consistently and everywhere. You navigate by siblings with C-M-n and C-M-p.

default-procedures and default-nodes

Internal variables. Do not set.

procedures-edit

Do not set.

Building your language template

Create a combobulate-<LANGUAGE>.el file. Ensure you follow the usual GNU Emacs header rules for elisp or crimp one from one of the other language files.

Your skeleton should a little bit like this:

(eval-and-compile
  (defconst combobulate-LANGUAGE-definitions
    ;; ... DEFINITIONS ...

    )))

 (define-combobulate-language
  :name NAME
  :language LANGUAGE
  :major-modes (LANGUAGE-mode LANGUAGE-ts-mode)
  :custom combobulate-LANGUAGE-definitions
  :extra-defcustoms EXTRA-DEFCUSTOMS
  :setup-fn combobulate-LANGUAGE-setup)

 (defun combobulate-LANGUAGE-setup (_))

Where LANGUAGE, of course, is the lowercased name of your language. Such as javascript.

The keyword :major-modes is a list of major modes where Combobulate can safely enable this language. Combobulate works in "tree-sitter major modes" and classic "non-tree-sitter major modes". So if you only have a classic major mode and no TS major mode? No problem.

Some languages have more than one competing major mode. By all means add more than two if you need to. Try and beat the ridiculous number of Javascript modes!

The :name field is the lowercased name of the language to use in the variable names. Usually it's the same as :language.

The :custom field is the form of definitions you're going to define going forward, and :extra-defcustoms is a special definition (defined much like combobulate-LANGUAGE-definitions) but it's for defcustom forms unique to your language. You only need this if you have built specialized features for the language, so it's often just left blank. See combobulate-jsx-extra-defcustoms for an example.

Finally, :setup-fn is a symbol for the function to call when Combobulate initializes your language in each buffer.

This blank canvas should be enough for Combobulate to activate when you run M-x combobulate-mode. Please do try it out with nothing more than an empty definitions list. If it works, you'll see a © in your mode line, and C-c o o (or whatever you keybind prefix is) should also work.

Let's talk about how to build the procedures.

Procedures & Rules

A procedure is a declarative mini-DSL that tells Combobulate how to pick nodes at, or near, a given point in the buffer.

A rule is another little mini language that retrieves node types from combobulate-rules.el and then filters these generic node types based on simple set theory (or Emacs's version of set theory, which is nowhere near as rigorous as it is in discrete mathematics.)

The rules inform the procedural system, so we will start with them, but before we do that, you'll want to ensure you know how to quickly evaluate elisp. I'm going to use M-x ielm to demonstrate this, and I recommend you do the same.

Configuring IELM

Run M-x ielm and C-c C-b to set its buffer to a buffer with Combobulate set up with your language. The basic, empty template is all you need here.

I recommend you eval this:

(defalias 'er 'combobulate-procedure-expand-rules)

For the purpose of experimenting with the rules system.

Rules

A rule, when executed with combobulate-procedure-expand-rules (henceforth, er), is expanded to a flat list of node types.

You can, if you so desire (but please don't), write out all the node types you want a procedure to use by hand. The reason I don't do this is that some languages have hundreds of node types, and one node type is often used in many places, as dictated by the language.

The docstring in er has a complete list of the rules to help guide you, but I'll walk you throw the ideas behind it.

er takes as an argument a list of RULES. To make life a little bit easier, it will also accept a string, and the atom t.

er is also recursive. Many of its forms take RULES as input, so you can build rather complex queries.

When you start looking at the trees of code in any detail, patterns quickly emerge. Languages are generally built on a number of shared assumptions around what can go where. Some languages have statements and expressions, and where you can use a statement and an expression are often different enough that you can consider a statement to be one class of code and expressions another.

So, for example, if you want to, in Python, build the sibling navigation code so you can move between "lines of code" as a human would objectively think of them as, you might write a def foo(): ... function and use a tool like treesit-explore-mode to have a look at what it generates.

You might realize, from experience, that you can put a with_statement, an if_statement, a for_statement etc. inside a function and you might start enumerating all these things out by hand: ("for_statement" "with_statement" ...).

That'd work. I guess. Probably. But you might -- will -- miss things. You'll also miss things if new features are added to the language at a later date.

What you want is a way of declaring, roughly, what you want, without getting too specific, unless you absolutely have to.

But you can also ask er to tell you all nodes that can take these statements:

ELISP> (er '(irule "if_statement"))
("module" "block" "_compound_statement")

By asking er what the inverse rule of if_statement is, we're told it has module, block and _compound_statement.

So if we went about this naively, we might miss that, yes, you can put and if statement at the module level; and that they also go inside something called a block; and also this weird meta-node (it begins with _, meaning it is not shown in the actual tree and as such, is not a real node) called _compound_statement.

What can we infer from these simple things?

Well, let's ask er to tell us what goes inside a module:

ELISP> (er '(rule "module"))
("if_statement" "class_definition" "try_statement" "global_statement"
 "decorated_definition" "match_statement" "type_alias_statement"
 "function_definition" "pass_statement" "with_statement"
 "print_statement" "assert_statement" "import_statement"
 "break_statement" "continue_statement" "raise_statement"
 "nonlocal_statement" "return_statement" "exec_statement"
 "for_statement" "import_from_statement" "delete_statement"
 "while_statement" "future_import_statement" "expression_statement")

And we see a range of statements, including some you'd probably miss if you did this by hand, even if you are a deft hand at python. (You'd for sure miss the useful and underused special statement nonlocal.)

Put another way: irule finds all parents that can take that particular node type, and rule returns all node types that have it as a parent.

As it turns out, with these two commands alone, you can often pick the right types of nodes you want in just a few simple rules.

The er command can take multiple rules at the same time. Because er operates on the notion of sets, they rules are union'ed together (and duplicates removed.):

ELISP> (er '((rule "list") (rule "tuple") (rule "set")))
("list_splat" "not_operator" "conditional_expression" "lambda"
 "unary_operator" "dictionary" "string" "tuple" "identifier" "true"
 "integer" "float" "list_comprehension" "false" "set_comprehension"
 "ellipsis" "none" "set" "await" "binary_operator"
 "generator_expression" "call" "attribute" "concatenated_string"
 "parenthesized_expression" "subscript" "list"
 "dictionary_comprehension" "comparison_operator" "boolean_operator"
 "as_pattern" "named_expression" "parenthesized_list_splat" "yield")

Sometimes, you just want everything:

ELISP> (er t)
("_compound_statement" "_simple_statement"  ... )

This is the same as passing (all):

ELISP> (er '((all)))
("_compound_statement" "_simple_statement"  ... )

That does occasionally come in handy when you want everything except a couple of things.

Let's say you want to do stuff with the nodes that can go inside a list list, but you don't want anything that has expressions as a rule parent:

ELISP> (er '(rule "expression"))
("not_operator" "conditional_expression" "lambda" "unary_operator"
 "dictionary" "string" "tuple" "identifier" "true" "integer" "float"
 "list_comprehension" "false" "list_splat" "set_comprehension"
 "ellipsis" "none" "set" "await" "binary_operator"
 "generator_expression" "call" "attribute" "concatenated_string"
 "parenthesized_expression" "subscript" "list"
 "dictionary_comprehension" "comparison_operator" "boolean_operator"
 "as_pattern" "named_expression")

ELISP> (er '(exclude (rule "list") (rule "expression")))
("parenthesized_list_splat" "yield")

The (exclude INCLUSIONS EXCLUSIONS) form does just that. It subtracts the EXCLUSIONS from INCLUSIONS and returns the remainder. That means order matters. It will not return the same results if you switch the inputs around.

Note that both INCLUSIONS and EXCLUSIONS accept RULES, so you can use nested and recursive expressions.

Having said that, it is highly unlikely you need to do anything beyond the bare minimum, unless your language's grammar is truly atrocious.

That is more or less all there is to know. The purpose of the rules engine is to make it easy to retrieve and filter collections of nodes without having to spell out every single one of them manually.

To summarize:

  1. Instead of typing out all possible node types that you want to navigate by, it's often easier to use their common parent node and ask Combobulate to give you all the node types that can appear in it:

    (rule "block")
    

    Instead of with_statement, if_statement, etc.

    This is what the rules system is here to do.

  2. Occasionally, the parent node you want to use has too much in it, perhaps because it is shared in multiple places, but you only want to match some of them. Using (exclude ...) is helpful in this case.

Let's move on the final piece of the puzzle.

Procedures

When you ask Combobulate to navigate to the next sibling, or move up or down the hierarchy of nodes, it must make an informed choice: which of the nodes at (or near) point is the right one to choose when deciding what the next sibling is; what the parent or child is; or indeed any other movement and editing command in Combobulate.

One way is the way it's done now in Combobulate: using a simple set of rules, we can tell Combobulate how to do sibling navigation so that it picks the right node(s).

A procedure is made up of two parts: a set of activation nodes and the selector.

This is the detailed description copied from the docstring of combobulate-procedure-apply:

PROCEDURE is a form matching the following pattern:

  (:activation-nodes (ACTIVATION-NODE-RULES ...))
  :selector SELECTOR-RULES)

Where ACTIVATION-NODE-RULES is a list of activation nodes, each
of which is a form matching the following pattern:

   (:nodes RULES
    [:position POSITION-RULE]
    [:has-parent HAS-PARENT-RULE]
    [:has-fields FIELDS]
    [:has-ancestor HAS-ANCESTOR-RULE])

Where RULES is one or more rules outlined in
‘combobulate-procedure-expand-rules’, and POSITION-RULE is one
of:

  ‘any’, meaning point is anywhere in the node, and is the default;
  ‘in’, that it is *not* at the beginning;
  and ‘at’, that it must be at the exact beginning of the node.

HAS-PARENT-RULE and HAS-ANCESTOR-RULE are optional, though at
most one can be used in an activation node rule. They each take a
list of RULES.

HAS-PARENT-RULE checks if the immediate parent of the action node
matches the rules, while HAS-ANCESTOR-RULE checks if any ancestor
of the action node matches the rules.

HAS-FIELD-RULE is a list of fields the action node be considered
to be inside of.

SELECTOR-RULES is a form matching the following pattern:

   (:choose <CHOICE>
    <MATCHER PROPERTY>)

Where <MATCHER PROPERTY> is one of:

    :match-query <QUERY-MATCHER>
    :match-children <NODE-MATCHER>
    :match-siblings <NODE-MATCHER>

And CHOICE is either ‘node’ or ‘parent’ (the default), indicating whether the
selector should operate on the action node or its matching
parent.

NODE-MATCHER must be either ‘t’, indicating all node types (but
not anonymous nodes); or a form of the following pattern:

   (:match-rules <RULES|t>
    :discard-rules <RULES|t>
    [:anonymous <nil|t>]
    [:default-mark <@match|@discard>])

Only one of ‘:match-rules’ or ‘:discard-rules’ can be used in
NODE-MATCHER. ‘:anonymous’ is a boolean and defaults to nil, and
‘:default-mark’ is a symbol and defaults to
‘@match’. ‘:default-mark’ is the tie-breaker when a node is not
matched by ‘:match-rules’ or ‘:discard-rules’.

QUERY-MATCHER must be of the form:

   (:query <QUERY
    [:engine <combobulate|treesitter>]
    [:discard-rules <RULES>])

Note that QUERY can be either Combobulate’s internal query
language to non-recursively match against NODE, or a regular
tree-sitter query that recursively matches against NODE and any
children of NODE. The ‘:engine’ flag determines which, and it
defaults to ‘combobulate’. ‘:discard-rules’ is a list of rules
(or ‘t’ indicating match everything, but not anonymous nodes).

Each procedure -- such as procedures-sibling or procedures-hierarchy -- can, and probably will, have more than one distinct procedure. You will most likely need 10-12 for siblings and probably half that for hierarchy. The rest? One to three. Usually.

When you use just about any command in Combobulate a procedure is likely triggered. How exactly it chooses nodes is an implementation detail and may change. However, in broad terms, it'll look at all possible nodes on or near point and iterate through them to try and find one that matches a procedure. Try it: M-x (combobulate-get-parents (combobulate-node-at-point)) to see just how many there might be at any given point.

Each procedure will usually cover one part of the grammar that is superficially similar enough that you can get away with one procedure.

Here's an example of what a sibling procedure looks like in Go:

(:activation-nodes
  ((:nodes ((rule "block") (rule "source_file"))
    :position at
    :has-parent ("block" "source_file")))
  :selector (:choose parent :match-children t))
Activation Nodes

The :activation-nodes takes a list of activation node specifiers.

Each specifier must have a :nodes keyword. It takes a rule (in the form of a string or list) made up of RULES as per the rules section above.

The other keywords are optional. The :position keyword mandates where point must be in relation to the activating node. Here it says at, meaning your point must be at the start of the node: not inside it; not at the end; but at the start. :position can also be in, meaning point must be inside (but not at!) a node; or any, meaning either at or in.

Believe it or not, but positioning is important. For example, in JSX/HTML, if you're at an HTML element, Combobulate will go to the child element with C-M-d, but if you're inside the element itself, it'll go to the attributes instead.

For most things, you can use any (the default if :position is not specified).

Finally, there is the optional :has-parent keyword. Sometimes, you want to do something with a common node type but only if the parent or ancestor matches one or more RULES.

The difference between :has-parent and :has-ancestor is that the former checks that the immediate parent of the node and :has-ancestor scans as far back as it has to for a match.

You can have more than one specifier in :activation-nodes. It takes a list of specifiers as I said before. So, if you have diverse trigger checks, each requiring different parents or positioning, you can combine them.

If no activation node specifier matches, Combobulate will try the next procedure in the list. Combobulate always starts at the beginning of the list and stops when the first match happens.

Therefore: put the most generic rules at the end of the list!

Selectors

The selector is only ever executed if an activation node rule matched. You can tell it did, because Combobulate's combobulate-procedure-result struct, and more on those in a bit, will say it has.

The purpose of a selector is to return nodes given a starting node. Think about it: the only way I can get the children, the siblings, or the result of a tree-sitter query is with a starting node. Everything operates from a node, even if that node is the root node of the tree.

Selectors operate from a given vantage. That is controlled by :choice which can be either node or parent.

Here node means the node that activated the procedure. For example, that might be tag_name in an HTML start_tag. So if tag_name is the node you wish to find the children of, then :choose node is the choice for you.

Often, you want to use a parent node; this is doubly true if you use :has-parent and :has-ancestor. But it is not a requirement. You can mandate that parent matches a rule without using the matched parent as the vantage point.

Once you have a chosen node, that node is given directly to the actual selector: :match-query, :match-children, or :match-siblings. You can only have one match selector.

These commands do two things. They first get the nodes directly from the CST or a query engine:

  • :match-children is equal to calling (combobulate-node-children NODE) where NODE is the activation node.
  • :match-siblings is equal to (combobulate-linear-siblings NODE).
  • :match-query is equal to either (combobulate-query-capture NODE QUERY) (if you use the tree-sitter engine) or (combobulate-query-search NODE QUERY) if you use Combobulate's internal query engine.

Next, a filter is applied, as per the documentation and type of matcher:

NODE-MATCHER must be either ‘t’, indicating all node types (but
not anonymous nodes); or a form of the following pattern:

   (:match-rules <RULES|t>
    :discard-rules <RULES|t>
    [:anonymous <nil|t>]
    [:default-mark <@match|@discard>])

Only one of ‘:match-rules’ or ‘:discard-rules’ can be used in
NODE-MATCHER. ‘:anonymous’ is a boolean and defaults to nil, and
‘:default-mark’ is a symbol and defaults to
‘@match’. ‘:default-mark’ is the tie-breaker when a node is not
matched by ‘:match-rules’ or ‘:discard-rules’.

QUERY-MATCHER must be of the form:

   (:query <QUERY
    [:engine <combobulate|treesitter>]
    [:discard-rules <RULES>])

Note that QUERY can be either Combobulate’s internal query
language to non-recursively match against NODE, or a regular
tree-sitter query that recursively matches against NODE and any
children of NODE. The ‘:engine’ flag determines which, and it
defaults to ‘combobulate’. ‘:discard-rules’ is a list of rules
(or ‘t’ indicating match everything, but not anonymous nodes).

A common pattern is to simply pick every child or sibling. In that case, you can use :match-children t for example.

If you want to keep only, or remove all but, certain node types, you must use the form above.

For queries it works much the same. You must choose an engine and maybe apply optional discard rules to those matches.

Writing a query is a matter of practice. But how do you test and experiment? Well, that's easy enough now that I've added some helper functions.

Testing / Experimenting with Procedures

The most simple, and basic way, is to just call the procedure function directly.

But first, to cut down on typing during testing:

(defalias 'cps 'combobulate-procedure-start)

Now you can feed it a list of procedures and a starting node. Make sure you run this command in the context of a buffer with a valid tree-sitter parser set up for the language you are writing. The most common choice is the smallest named node at point: combobulate-node-at-point. Combobulate will expand the search as required.

Combobulate may return multiple procedure results. After all, there may well be more than one that correctly matches. How multiple matches are used is an implementation detail. I recommend you make your procedures as unambiguous as possible.

It's a bit hard to read and understand the output. To make life easier, you can use this helper function instead:

(combobulate-test-mark-according-to-procedure BUF PROCEDURES &optional PT)

Give it the string name of a buffer to run the procedures on. Ensure your point is in the place you're testing, and give it the procedures you've written. Do not mix and match different types of procedures: if you're doing siblings, just do those.

You'll see a buffer pop up with a pretty-printed set of procedure results in an outline mode buffer.

Make a note of the hyperlink "Place Text Fixture Overlays."

The Selected Nodes headline is the interesting one. It tells you the matched and discarded node types it encountered. The Matched Nodes headline is just like the selected nodes, but with the discarded nodes filtered and the cons cells replaces with a flat list of matches.

Verify that you are happy with the procedure.

When you are, it's time to write some actual integration tests. It's really quite easy.

Writing Tests

So here's the best way to write a functional procedure and a test for it, step-by-step.

  1. Open tests/fixtures/ and make a note of the directory structure. They're more or less named after the features Combobulate has. You only need to worry about:
    • down for C-M-d hierarchy procedures.
    • sequence for specialized sequence procedures, if you require them.
    • sibling for the all-important sibling procedures.

Go into one of the directories and create a new file with a sensible filename and extension. For example, dictionary.foo if you want to test sibling navigation in your foo language.

  1. Write out some basic code for the language and the feature you're trying to demonstrate.

    One language feature, one file.

  2. Save the file. Now, run the procedure result helper from before.

  3. Find the result that matches and press the hyperlink "Place Text Fixture Overlays."

    Combobulate will activate combobulate-test-fixture-mode in the file and prime it with file-local variables and then it will insert numbered overlays starting with 1 for the first match; 2 for the second; and so on up to 9.

    These numbers and their placements are VERY important.

    Combobulate's test harness will:

    1. Open the file;
    2. Jump to the first number;
    3. Execute a predefined function, such as combobulate-navigate-next, and validate that point is now on the next number (from 1 -> 2, for example) and throw an error if it does not.
    4. It will repeat this procedure until it runs out of numbers.
  4. Provided the numbers are placed correctly, you can save the file. The test fixture is now done.

The next step is code generating the new tests.

Re-generating the tests

To do so, run make build-tests. You'll end up with a bunch of regenerated test files. Now use make run-tests to run the tests.

(You can also run these inside Docker using the supplied Dockerfile. There are makefile targets to help with this also.)

It works like this: manual test writing for movement is cumbersome (hence the overlay magic) and testing changes made to code with editing is even more annoying. Combobulate code generates most of these things based on the fixture files and the resultant fixture-deltas directories.

When you build the tests, the Makefile deletes every single code generated fixture delta and code generated elisp file. They are then recreated.

Why? Because:

  • If a command has changed and now creates slightly different output (whitespace only, perhaps, or maybe something more significant) then Git will complain and tell you that certain fixture deltas have changed.

    That's a warning sign to me that maybe my code refactoring has broken something.

  • You may end up with new, uncommitted fixture deltas. That is usually OK. For example, if you add a sibling fixture file, you will end up with many fixture deltas for Combobulate's drag command tests, because -- surprise -- the sibling commands are used to drag.

Carefully review that the delta files look correct. You can always open up your fixture and try the drag commands yourself.

When you are happy that everything is working, you will be left with a bunch of new fixtures and fixture deltas alongside your new language definition.

All that's left is to use it for a while and ensure it's up to scratch, and then raise a PR and we can talk about the finer points.