docs: add more information on supertype nodes for grammars and queries

2025-09-14 05:40:19 -04:00 · 2025-09-14 05:40:19 -04:00 · 3a911d578c
commit 3a911d578c
parent 63f48afaeb
3 changed files with 89 additions and 18 deletions
--- a/docs/src/creating-parsers/2-the-grammar-dsl.md
+++ b/docs/src/creating-parsers/2-the-grammar-dsl.md
@ -129,8 +129,11 @@ than globally. Can only be used with parse precedence, not lexical precedence.
 - **`word`** — the name of a token that will match keywords to the
 [keyword extraction][keyword-extraction] optimization.

- **`supertypes`** — an array of hidden rule names which should be considered to be 'supertypes' in the generated
-[*node types* file][static-node-types].
+- **`supertypes`** — an array of rule names which should be considered to be 'supertypes' in the generated
+[*node types* file][static-node-types-supertypes]. Supertype rules are automatically hidden from the parse tree, regardless
+of whether their names start with an underscore. The main use case for supertypes is to group together multiple different
+kinds of nodes under a single abstract category, such as "expression" or "declaration". See the section on [`using supertypes`][supertypes]
+for more details.

 - **`reserved`** — similar in structure to the main `rules` property, an object of reserved word sets associated with an
 array of reserved rules. The reserved rule in the array must be a terminal token meaning it must be a string, regex, token,
@ -144,11 +147,13 @@ empty array, signifying *no* keywords are reserved.
 [bison-dprec]: https://www.gnu.org/software/bison/manual/html_node/Generalized-LR-Parsing.html
 [ebnf]: https://en.wikipedia.org/wiki/Extended_Backus%E2%80%93Naur_form
 [external-scanners]: ./4-external-scanners.md
-[extras]: ./3-writing-the-grammar.html#using-extras
+[extras]: ./3-writing-the-grammar.md#using-extras
 [keyword-extraction]: ./3-writing-the-grammar.md#keyword-extraction
 [lexical vs parse]: ./3-writing-the-grammar.md#lexical-precedence-vs-parse-precedence
 [lr-conflict]: https://en.wikipedia.org/wiki/LR_parser#Conflicts_in_the_constructed_tables
 [named-vs-anonymous-nodes]: ../using-parsers/2-basic-parsing.md#named-vs-anonymous-nodes
 [rust regex]: https://docs.rs/regex/1.1.8/regex/#grouping-and-flags
 [static-node-types]: ../using-parsers/6-static-node-types.md
+[static-node-types-supertypes]: ../using-parsers/6-static-node-types.md#supertype-nodes
+[supertypes]: ./3-writing-the-grammar.md#using-supertypes
 [yacc-prec]: https://docs.oracle.com/cd/E19504-01/802-5880/6i9k05dh3/index.html
--- a/docs/src/creating-parsers/3-writing-the-grammar.md
+++ b/docs/src/creating-parsers/3-writing-the-grammar.md
@ -74,11 +74,11 @@ you might start with something like this:

    return_statement: $ => seq(
      'return',
-      $._expression,
+      $.expression,
      ';'
    ),

-    _expression: $ => choice(
+    expression: $ => choice(
      $.identifier,
      $.number
      // TODO: other kinds of expressions
@ -202,7 +202,7 @@ To produce a readable syntax tree, we'd like to model JavaScript expressions usi
 {
  // ...

-  _expression: $ => choice(
+  expression: $ => choice(
    $.identifier,
    $.unary_expression,
    $.binary_expression,
@ -210,14 +210,14 @@ To produce a readable syntax tree, we'd like to model JavaScript expressions usi
  ),

  unary_expression: $ => choice(
-    seq('-', $._expression),
-    seq('!', $._expression),
+    seq('-', $.expression),
+    seq('!', $.expression),
    // ...
  ),

  binary_expression: $ => choice(
-    seq($._expression, '*', $._expression),
-    seq($._expression, '+', $._expression),
+    seq($.expression, '*', $.expression),
+    seq($.expression, '+', $.expression),
    // ...
  ),
 }
@ -252,7 +252,7 @@ ambiguity.
 For an expression like `-a * b`, it's not clear whether the `-` operator applies to the `a * b` or just to the `a`. This
 is where the `prec` function [described in the previous page][grammar dsl] comes into play. By wrapping a rule with `prec`,
 we can indicate that certain sequence of symbols should _bind to each other more tightly_ than others. For example, the
-`'-', $._expression` sequence in `unary_expression` should bind more tightly than the `$._expression, '+', $._expression`
+`'-', $.expression` sequence in `unary_expression` should bind more tightly than the `$.expression, '+', $.expression`
 sequence in `binary_expression`:

 ```js
@ -263,8 +263,8 @@ sequence in `binary_expression`:
    prec(
      2,
      choice(
-        seq("-", $._expression),
-        seq("!", $._expression),
+        seq("-", $.expression),
+        seq("!", $.expression),
        // ...
      ),
    );
@ -299,8 +299,8 @@ This is where `prec.left` and `prec.right` come into use. We want to select the
  // ...

  binary_expression: $ => choice(
-    prec.left(2, seq($._expression, '*', $._expression)),
-    prec.left(1, seq($._expression, '+', $._expression)),
+    prec.left(2, seq($.expression, '*', $.expression)),
+    prec.left(1, seq($.expression, '+', $.expression)),
    // ...
  ),
 }
@ -476,6 +476,51 @@ typically in ways that don't affect the meaning of the pattern. For example, `\w
 to `[ \t\n\r]`, and `\d` to `[0-9]`. If you need more complex behavior, you can always use a more explicit regex.
 ```

+## Using Supertypes
+
+Some rules in your grammar will represent abstract categories of syntax nodes, such as "expression", "type", or "declaration".
+These rules are often defined as simple choices between several other rules. For example, in the JavaScript grammar, the
+`_expression` rule is defined as a choice between many different kinds of expressions:
+
+```js
+expression: $ => choice(
+  $.identifier,
+  $.unary_expression,
+  $.binary_expression,
+  $.call_expression,
+  $.member_expression,
+  // ...
+),
+```
+
+By default, Tree-sitter will generate a visible node type for each of these abstract category rules, which can lead to
+unnecessarily deep and complex syntax trees. To avoid this, you can add these abstract category rules to the grammar's `supertypes`
+definition. Tree-sitter will then treat these rules as _supertypes_, and will not generate visible node types for them in
+the syntax tree.
+
+```js
+module.exports = grammar({
+  name: "javascript",
+
+  supertypes: $ => [
+    $.expression,
+  ],
+
+  rules: {
+    expression: $ => choice(
+      $.identifier,
+      // ...
+    ),
+
+    // ...
+  },
+});
+_
+```
+
+Although supertype rules are hidden from the syntax tree, they can still be used in queries. See the chapter on
+[Query Syntax][query syntax] for more information.
+
 # Lexical Analysis

 Tree-sitter's parsing process is divided into two phases: parsing (which is described above) and [lexing][lexing] — the
@ -554,7 +599,7 @@ grammar({
  word: $ => $.identifier,

  rules: {
-    _expression: $ =>
+    expression: $ =>
      choice(
        $.identifier,
        $.unary_expression,
@ -564,13 +609,13 @@ grammar({

    binary_expression: $ =>
      choice(
-        prec.left(1, seq($._expression, "instanceof", $._expression)),
+        prec.left(1, seq($.expression, "instanceof", $.expression)),
        // ...
      ),

    unary_expression: $ =>
      choice(
-        prec.left(2, seq("typeof", $._expression)),
+        prec.left(2, seq("typeof", $.expression)),
        // ...
      ),

@ -607,5 +652,6 @@ rule that's called something else, you should just alias the word token instead,
 [field-names-section]: ../using-parsers/2-basic-parsing.md#node-field-names
 [non-terminal]: https://en.wikipedia.org/wiki/Terminal_and_nonterminal_symbols
 [peg]: https://en.wikipedia.org/wiki/Parsing_expression_grammar
+[query syntax]: ../using-parsers/queries/1-syntax.md#supertype-nodes
 [tree-sitter-javascript]: https://github.com/tree-sitter/tree-sitter-javascript
 [yacc]: https://en.wikipedia.org/wiki/Yacc
--- a/docs/src/using-parsers/queries/1-syntax.md
+++ b/docs/src/using-parsers/queries/1-syntax.md
@ -96,6 +96,26 @@ by `(ERROR)` queries. Specific missing node types can also be queried:
 (MISSING ";") @missing-semicolon
 ```

+### Supertype Nodes
+
+Some node types are marked as _supertypes_ in a grammar. A supertype is a node type that contains multiple
+subtypes. For example, in the [JavaScript grammar example][grammar], `expression` is a supertype that can represent any kind
+of expression, such as a `binary_expression`, `call_expression`, or `identifier`. You can use supertypes in queries to match
+any of their subtypes, rather than having to list out each subtype individually. For example, this pattern would match any
+kind of expression, even though it's not a visible node in the syntax tree:
+
+```query
+(expression) @any-expression
+```
+
+To query specific subtypes of a supertype, you can use the syntax `supertype/subtype`. For example, this pattern would
+match a `binary_expression` only if it is a child of `expression`:
+
+```query
+(expression/binary_expression) @binary-expression
+```
+
+[grammar]: ../../creating-parsers/3-writing-the-grammar.md#structuring-rules-well
 [node-field-names]: ../2-basic-parsing.md#node-field-names
 [named-vs-anonymous-nodes]: ../2-basic-parsing.md#named-vs-anonymous-nodes
 [s-exp]: https://en.wikipedia.org/wiki/S-expression