docs: explain extras in a bit more detail

2025-09-14 05:32:26 -04:00 · 2025-09-14 05:32:26 -04:00 · 63f48afaeb
commit 63f48afaeb
parent ac39aed7c5
2 changed files with 83 additions and 1 deletions
--- a/docs/src/creating-parsers/2-the-grammar-dsl.md
+++ b/docs/src/creating-parsers/2-the-grammar-dsl.md
@ -107,7 +107,7 @@ grammar rules themselves. These fields are:

 - **`extras`** — an array of tokens that may appear *anywhere* in the language. This is often used for whitespace and
 comments. The default value of `extras` is to accept whitespace. To control whitespace explicitly, specify
-`extras: $ => []` in your grammar.
+`extras: $ => []` in your grammar. See the section on [using extras][extras] for more details.

 - **`inline`** — an array of rule names that should be automatically *removed* from the grammar by replacing all of their
 usages with a copy of their definition. This is useful for rules that are used in multiple places but for which you *don't*
@ -144,6 +144,7 @@ empty array, signifying *no* keywords are reserved.
 [bison-dprec]: https://www.gnu.org/software/bison/manual/html_node/Generalized-LR-Parsing.html
 [ebnf]: https://en.wikipedia.org/wiki/Extended_Backus%E2%80%93Naur_form
 [external-scanners]: ./4-external-scanners.md
+[extras]: ./3-writing-the-grammar.html#using-extras
 [keyword-extraction]: ./3-writing-the-grammar.md#keyword-extraction
 [lexical vs parse]: ./3-writing-the-grammar.md#lexical-precedence-vs-parse-precedence
 [lr-conflict]: https://en.wikipedia.org/wiki/LR_parser#Conflicts_in_the_constructed_tables
--- a/docs/src/creating-parsers/3-writing-the-grammar.md
+++ b/docs/src/creating-parsers/3-writing-the-grammar.md
@ -395,6 +395,87 @@ function_definition: $ =>

 Adding fields like this allows you to retrieve nodes using the [field APIs][field-names-section].

+## Using Extras
+
+Extras are tokens that can appear anywhere in the grammar, without being explicitly mentioned in a rule. This is useful
+for things like whitespace and comments, which can appear between any two tokens in most programming languages. To define
+an extra, you can use the `extras` function:
+
+```js
+module.exports = grammar({
+  name: "my_language",
+
+  extras: ($) => [
+    /\s/, // whitespace
+    $.comment,
+  ],
+
+  rules: {
+    comment: ($) =>
+      token(
+        choice(seq("//", /.*/), seq("/*", /[^*]*\*+([^/*][^*]*\*+)*/, "/")),
+      ),
+  },
+});
+```
+
+```admonish warning
+When adding more complicated tokens to `extras`, it's preferable to associate the pattern
+with a rule. This way, you avoid the lexer inlining this pattern in a bunch of spots,
+which can dramatically reduce the parser size.
+```
+
+For example, instead of defining the `comment` token inline in `extras`:
+
+```js
+// ❌ Less preferable
+
+const comment = token(
+  choice(seq("//", /.*/), seq("/*", /[^*]*\*+([^/*][^*]*\*+)*/, "/")),
+);
+
+module.exports = grammar({
+  name: "my_language",
+  extras: ($) => [
+    /\s/, // whitespace
+    comment,
+  ],
+  rules: {
+    // ...
+  },
+});
+```
+
+We can define it as a rule and then reference it in `extras`:
+
+```js
+// ✅ More preferable
+
+module.exports = grammar({
+  name: "my_language",
+
+  extras: ($) => [
+    /\s/, // whitespace
+    $.comment,
+  ],
+
+  rules: {
+    // ...
+
+    comment: ($) =>
+      token(
+        choice(seq("//", /.*/), seq("/*", /[^*]*\*+([^/*][^*]*\*+)*/, "/")),
+      ),
+  },
+});
+```
+
+```admonish note
+Tree-sitter intentionally simplifies some common regex patterns, both as a performance optimization and for simplicity,
+typically in ways that don't affect the meaning of the pattern. For example, `\w` is simplified to `[a-zA-Z0-9_]`, `\s`
+to `[ \t\n\r]`, and `\d` to `[0-9]`. If you need more complex behavior, you can always use a more explicit regex.
+```
+
 # Lexical Analysis

 Tree-sitter's parsing process is divided into two phases: parsing (which is described above) and [lexing][lexing] — the