diff --git a/docs/src/creating-parsers/2-the-grammar-dsl.md b/docs/src/creating-parsers/2-the-grammar-dsl.md index 55c59f68..24495b15 100644 --- a/docs/src/creating-parsers/2-the-grammar-dsl.md +++ b/docs/src/creating-parsers/2-the-grammar-dsl.md @@ -107,7 +107,7 @@ grammar rules themselves. These fields are: - **`extras`** — an array of tokens that may appear *anywhere* in the language. This is often used for whitespace and comments. The default value of `extras` is to accept whitespace. To control whitespace explicitly, specify -`extras: $ => []` in your grammar. +`extras: $ => []` in your grammar. See the section on [using extras][extras] for more details. - **`inline`** — an array of rule names that should be automatically *removed* from the grammar by replacing all of their usages with a copy of their definition. This is useful for rules that are used in multiple places but for which you *don't* @@ -144,6 +144,7 @@ empty array, signifying *no* keywords are reserved. [bison-dprec]: https://www.gnu.org/software/bison/manual/html_node/Generalized-LR-Parsing.html [ebnf]: https://en.wikipedia.org/wiki/Extended_Backus%E2%80%93Naur_form [external-scanners]: ./4-external-scanners.md +[extras]: ./3-writing-the-grammar.html#using-extras [keyword-extraction]: ./3-writing-the-grammar.md#keyword-extraction [lexical vs parse]: ./3-writing-the-grammar.md#lexical-precedence-vs-parse-precedence [lr-conflict]: https://en.wikipedia.org/wiki/LR_parser#Conflicts_in_the_constructed_tables diff --git a/docs/src/creating-parsers/3-writing-the-grammar.md b/docs/src/creating-parsers/3-writing-the-grammar.md index 047e6e92..3c7c40c5 100644 --- a/docs/src/creating-parsers/3-writing-the-grammar.md +++ b/docs/src/creating-parsers/3-writing-the-grammar.md @@ -395,6 +395,87 @@ function_definition: $ => Adding fields like this allows you to retrieve nodes using the [field APIs][field-names-section]. +## Using Extras + +Extras are tokens that can appear anywhere in the grammar, without being explicitly mentioned in a rule. This is useful +for things like whitespace and comments, which can appear between any two tokens in most programming languages. To define +an extra, you can use the `extras` function: + +```js +module.exports = grammar({ + name: "my_language", + + extras: ($) => [ + /\s/, // whitespace + $.comment, + ], + + rules: { + comment: ($) => + token( + choice(seq("//", /.*/), seq("/*", /[^*]*\*+([^/*][^*]*\*+)*/, "/")), + ), + }, +}); +``` + +```admonish warning +When adding more complicated tokens to `extras`, it's preferable to associate the pattern +with a rule. This way, you avoid the lexer inlining this pattern in a bunch of spots, +which can dramatically reduce the parser size. +``` + +For example, instead of defining the `comment` token inline in `extras`: + +```js +// ❌ Less preferable + +const comment = token( + choice(seq("//", /.*/), seq("/*", /[^*]*\*+([^/*][^*]*\*+)*/, "/")), +); + +module.exports = grammar({ + name: "my_language", + extras: ($) => [ + /\s/, // whitespace + comment, + ], + rules: { + // ... + }, +}); +``` + +We can define it as a rule and then reference it in `extras`: + +```js +// ✅ More preferable + +module.exports = grammar({ + name: "my_language", + + extras: ($) => [ + /\s/, // whitespace + $.comment, + ], + + rules: { + // ... + + comment: ($) => + token( + choice(seq("//", /.*/), seq("/*", /[^*]*\*+([^/*][^*]*\*+)*/, "/")), + ), + }, +}); +``` + +```admonish note +Tree-sitter intentionally simplifies some common regex patterns, both as a performance optimization and for simplicity, +typically in ways that don't affect the meaning of the pattern. For example, `\w` is simplified to `[a-zA-Z0-9_]`, `\s` +to `[ \t\n\r]`, and `\d` to `[0-9]`. If you need more complex behavior, you can always use a more explicit regex. +``` + # Lexical Analysis Tree-sitter's parsing process is divided into two phases: parsing (which is described above) and [lexing][lexing] — the