From 3a911d578c91edbd78edbcadd689a9ddde871b91 Mon Sep 17 00:00:00 2001 From: Amaan Qureshi Date: Sun, 14 Sep 2025 05:40:19 -0400 Subject: [PATCH] docs: add more information on supertype nodes for grammars and queries --- .../src/creating-parsers/2-the-grammar-dsl.md | 11 ++- .../creating-parsers/3-writing-the-grammar.md | 76 +++++++++++++++---- docs/src/using-parsers/queries/1-syntax.md | 20 +++++ 3 files changed, 89 insertions(+), 18 deletions(-) diff --git a/docs/src/creating-parsers/2-the-grammar-dsl.md b/docs/src/creating-parsers/2-the-grammar-dsl.md index 24495b15..d210619b 100644 --- a/docs/src/creating-parsers/2-the-grammar-dsl.md +++ b/docs/src/creating-parsers/2-the-grammar-dsl.md @@ -129,8 +129,11 @@ than globally. Can only be used with parse precedence, not lexical precedence. - **`word`** — the name of a token that will match keywords to the [keyword extraction][keyword-extraction] optimization. -- **`supertypes`** — an array of hidden rule names which should be considered to be 'supertypes' in the generated -[*node types* file][static-node-types]. +- **`supertypes`** — an array of rule names which should be considered to be 'supertypes' in the generated +[*node types* file][static-node-types-supertypes]. Supertype rules are automatically hidden from the parse tree, regardless +of whether their names start with an underscore. The main use case for supertypes is to group together multiple different +kinds of nodes under a single abstract category, such as "expression" or "declaration". See the section on [`using supertypes`][supertypes] +for more details. - **`reserved`** — similar in structure to the main `rules` property, an object of reserved word sets associated with an array of reserved rules. The reserved rule in the array must be a terminal token meaning it must be a string, regex, token, @@ -144,11 +147,13 @@ empty array, signifying *no* keywords are reserved. [bison-dprec]: https://www.gnu.org/software/bison/manual/html_node/Generalized-LR-Parsing.html [ebnf]: https://en.wikipedia.org/wiki/Extended_Backus%E2%80%93Naur_form [external-scanners]: ./4-external-scanners.md -[extras]: ./3-writing-the-grammar.html#using-extras +[extras]: ./3-writing-the-grammar.md#using-extras [keyword-extraction]: ./3-writing-the-grammar.md#keyword-extraction [lexical vs parse]: ./3-writing-the-grammar.md#lexical-precedence-vs-parse-precedence [lr-conflict]: https://en.wikipedia.org/wiki/LR_parser#Conflicts_in_the_constructed_tables [named-vs-anonymous-nodes]: ../using-parsers/2-basic-parsing.md#named-vs-anonymous-nodes [rust regex]: https://docs.rs/regex/1.1.8/regex/#grouping-and-flags [static-node-types]: ../using-parsers/6-static-node-types.md +[static-node-types-supertypes]: ../using-parsers/6-static-node-types.md#supertype-nodes +[supertypes]: ./3-writing-the-grammar.md#using-supertypes [yacc-prec]: https://docs.oracle.com/cd/E19504-01/802-5880/6i9k05dh3/index.html diff --git a/docs/src/creating-parsers/3-writing-the-grammar.md b/docs/src/creating-parsers/3-writing-the-grammar.md index 3c7c40c5..0052198b 100644 --- a/docs/src/creating-parsers/3-writing-the-grammar.md +++ b/docs/src/creating-parsers/3-writing-the-grammar.md @@ -74,11 +74,11 @@ you might start with something like this: return_statement: $ => seq( 'return', - $._expression, + $.expression, ';' ), - _expression: $ => choice( + expression: $ => choice( $.identifier, $.number // TODO: other kinds of expressions @@ -202,7 +202,7 @@ To produce a readable syntax tree, we'd like to model JavaScript expressions usi { // ... - _expression: $ => choice( + expression: $ => choice( $.identifier, $.unary_expression, $.binary_expression, @@ -210,14 +210,14 @@ To produce a readable syntax tree, we'd like to model JavaScript expressions usi ), unary_expression: $ => choice( - seq('-', $._expression), - seq('!', $._expression), + seq('-', $.expression), + seq('!', $.expression), // ... ), binary_expression: $ => choice( - seq($._expression, '*', $._expression), - seq($._expression, '+', $._expression), + seq($.expression, '*', $.expression), + seq($.expression, '+', $.expression), // ... ), } @@ -252,7 +252,7 @@ ambiguity. For an expression like `-a * b`, it's not clear whether the `-` operator applies to the `a * b` or just to the `a`. This is where the `prec` function [described in the previous page][grammar dsl] comes into play. By wrapping a rule with `prec`, we can indicate that certain sequence of symbols should _bind to each other more tightly_ than others. For example, the -`'-', $._expression` sequence in `unary_expression` should bind more tightly than the `$._expression, '+', $._expression` +`'-', $.expression` sequence in `unary_expression` should bind more tightly than the `$.expression, '+', $.expression` sequence in `binary_expression`: ```js @@ -263,8 +263,8 @@ sequence in `binary_expression`: prec( 2, choice( - seq("-", $._expression), - seq("!", $._expression), + seq("-", $.expression), + seq("!", $.expression), // ... ), ); @@ -299,8 +299,8 @@ This is where `prec.left` and `prec.right` come into use. We want to select the // ... binary_expression: $ => choice( - prec.left(2, seq($._expression, '*', $._expression)), - prec.left(1, seq($._expression, '+', $._expression)), + prec.left(2, seq($.expression, '*', $.expression)), + prec.left(1, seq($.expression, '+', $.expression)), // ... ), } @@ -476,6 +476,51 @@ typically in ways that don't affect the meaning of the pattern. For example, `\w to `[ \t\n\r]`, and `\d` to `[0-9]`. If you need more complex behavior, you can always use a more explicit regex. ``` +## Using Supertypes + +Some rules in your grammar will represent abstract categories of syntax nodes, such as "expression", "type", or "declaration". +These rules are often defined as simple choices between several other rules. For example, in the JavaScript grammar, the +`_expression` rule is defined as a choice between many different kinds of expressions: + +```js +expression: $ => choice( + $.identifier, + $.unary_expression, + $.binary_expression, + $.call_expression, + $.member_expression, + // ... +), +``` + +By default, Tree-sitter will generate a visible node type for each of these abstract category rules, which can lead to +unnecessarily deep and complex syntax trees. To avoid this, you can add these abstract category rules to the grammar's `supertypes` +definition. Tree-sitter will then treat these rules as _supertypes_, and will not generate visible node types for them in +the syntax tree. + +```js +module.exports = grammar({ + name: "javascript", + + supertypes: $ => [ + $.expression, + ], + + rules: { + expression: $ => choice( + $.identifier, + // ... + ), + + // ... + }, +}); +_ +``` + +Although supertype rules are hidden from the syntax tree, they can still be used in queries. See the chapter on +[Query Syntax][query syntax] for more information. + # Lexical Analysis Tree-sitter's parsing process is divided into two phases: parsing (which is described above) and [lexing][lexing] — the @@ -554,7 +599,7 @@ grammar({ word: $ => $.identifier, rules: { - _expression: $ => + expression: $ => choice( $.identifier, $.unary_expression, @@ -564,13 +609,13 @@ grammar({ binary_expression: $ => choice( - prec.left(1, seq($._expression, "instanceof", $._expression)), + prec.left(1, seq($.expression, "instanceof", $.expression)), // ... ), unary_expression: $ => choice( - prec.left(2, seq("typeof", $._expression)), + prec.left(2, seq("typeof", $.expression)), // ... ), @@ -607,5 +652,6 @@ rule that's called something else, you should just alias the word token instead, [field-names-section]: ../using-parsers/2-basic-parsing.md#node-field-names [non-terminal]: https://en.wikipedia.org/wiki/Terminal_and_nonterminal_symbols [peg]: https://en.wikipedia.org/wiki/Parsing_expression_grammar +[query syntax]: ../using-parsers/queries/1-syntax.md#supertype-nodes [tree-sitter-javascript]: https://github.com/tree-sitter/tree-sitter-javascript [yacc]: https://en.wikipedia.org/wiki/Yacc diff --git a/docs/src/using-parsers/queries/1-syntax.md b/docs/src/using-parsers/queries/1-syntax.md index 5edd0047..a12cec70 100644 --- a/docs/src/using-parsers/queries/1-syntax.md +++ b/docs/src/using-parsers/queries/1-syntax.md @@ -96,6 +96,26 @@ by `(ERROR)` queries. Specific missing node types can also be queried: (MISSING ";") @missing-semicolon ``` +### Supertype Nodes + +Some node types are marked as _supertypes_ in a grammar. A supertype is a node type that contains multiple +subtypes. For example, in the [JavaScript grammar example][grammar], `expression` is a supertype that can represent any kind +of expression, such as a `binary_expression`, `call_expression`, or `identifier`. You can use supertypes in queries to match +any of their subtypes, rather than having to list out each subtype individually. For example, this pattern would match any +kind of expression, even though it's not a visible node in the syntax tree: + +```query +(expression) @any-expression +``` + +To query specific subtypes of a supertype, you can use the syntax `supertype/subtype`. For example, this pattern would +match a `binary_expression` only if it is a child of `expression`: + +```query +(expression/binary_expression) @binary-expression +``` + +[grammar]: ../../creating-parsers/3-writing-the-grammar.md#structuring-rules-well [node-field-names]: ../2-basic-parsing.md#node-field-names [named-vs-anonymous-nodes]: ../2-basic-parsing.md#named-vs-anonymous-nodes [s-exp]: https://en.wikipedia.org/wiki/S-expression