docs: explain extras in a bit more detail

This commit is contained in:
Amaan Qureshi 2025-09-14 05:32:26 -04:00 committed by Amaan Qureshi
parent ac39aed7c5
commit 63f48afaeb
2 changed files with 83 additions and 1 deletions

View file

@ -107,7 +107,7 @@ grammar rules themselves. These fields are:
- **`extras`** — an array of tokens that may appear *anywhere* in the language. This is often used for whitespace and
comments. The default value of `extras` is to accept whitespace. To control whitespace explicitly, specify
`extras: $ => []` in your grammar.
`extras: $ => []` in your grammar. See the section on [using extras][extras] for more details.
- **`inline`** — an array of rule names that should be automatically *removed* from the grammar by replacing all of their
usages with a copy of their definition. This is useful for rules that are used in multiple places but for which you *don't*
@ -144,6 +144,7 @@ empty array, signifying *no* keywords are reserved.
[bison-dprec]: https://www.gnu.org/software/bison/manual/html_node/Generalized-LR-Parsing.html
[ebnf]: https://en.wikipedia.org/wiki/Extended_Backus%E2%80%93Naur_form
[external-scanners]: ./4-external-scanners.md
[extras]: ./3-writing-the-grammar.html#using-extras
[keyword-extraction]: ./3-writing-the-grammar.md#keyword-extraction
[lexical vs parse]: ./3-writing-the-grammar.md#lexical-precedence-vs-parse-precedence
[lr-conflict]: https://en.wikipedia.org/wiki/LR_parser#Conflicts_in_the_constructed_tables

View file

@ -395,6 +395,87 @@ function_definition: $ =>
Adding fields like this allows you to retrieve nodes using the [field APIs][field-names-section].
## Using Extras
Extras are tokens that can appear anywhere in the grammar, without being explicitly mentioned in a rule. This is useful
for things like whitespace and comments, which can appear between any two tokens in most programming languages. To define
an extra, you can use the `extras` function:
```js
module.exports = grammar({
name: "my_language",
extras: ($) => [
/\s/, // whitespace
$.comment,
],
rules: {
comment: ($) =>
token(
choice(seq("//", /.*/), seq("/*", /[^*]*\*+([^/*][^*]*\*+)*/, "/")),
),
},
});
```
```admonish warning
When adding more complicated tokens to `extras`, it's preferable to associate the pattern
with a rule. This way, you avoid the lexer inlining this pattern in a bunch of spots,
which can dramatically reduce the parser size.
```
For example, instead of defining the `comment` token inline in `extras`:
```js
// ❌ Less preferable
const comment = token(
choice(seq("//", /.*/), seq("/*", /[^*]*\*+([^/*][^*]*\*+)*/, "/")),
);
module.exports = grammar({
name: "my_language",
extras: ($) => [
/\s/, // whitespace
comment,
],
rules: {
// ...
},
});
```
We can define it as a rule and then reference it in `extras`:
```js
// ✅ More preferable
module.exports = grammar({
name: "my_language",
extras: ($) => [
/\s/, // whitespace
$.comment,
],
rules: {
// ...
comment: ($) =>
token(
choice(seq("//", /.*/), seq("/*", /[^*]*\*+([^/*][^*]*\*+)*/, "/")),
),
},
});
```
```admonish note
Tree-sitter intentionally simplifies some common regex patterns, both as a performance optimization and for simplicity,
typically in ways that don't affect the meaning of the pattern. For example, `\w` is simplified to `[a-zA-Z0-9_]`, `\s`
to `[ \t\n\r]`, and `\d` to `[0-9]`. If you need more complex behavior, you can always use a more explicit regex.
```
# Lexical Analysis
Tree-sitter's parsing process is divided into two phases: parsing (which is described above) and [lexing][lexing] — the