Merge remote-tracking branch 'upstream/master' into fix/wasm32-malloc

This commit is contained in:
Trim21 2026-01-16 01:06:59 +08:00
commit 5fbb1b1ebd
30 changed files with 146 additions and 118 deletions

View file

@ -7,7 +7,8 @@
[npmjs.com]: https://www.npmjs.org/package/tree-sitter-cli
[npmjs.com badge]: https://img.shields.io/npm/v/tree-sitter-cli.svg?color=%23BF4A4A
The Tree-sitter CLI allows you to develop, test, and use Tree-sitter grammars from the command line. It works on `MacOS`, `Linux`, and `Windows`.
The Tree-sitter CLI allows you to develop, test, and use Tree-sitter grammars from the command line. It works on `MacOS`,
`Linux`, and `Windows`.
### Installation
@ -34,9 +35,11 @@ The `tree-sitter` binary itself has no dependencies, but specific commands have
### Commands
* `generate` - The `tree-sitter generate` command will generate a Tree-sitter parser based on the grammar in the current working directory. See [the documentation] for more information.
* `generate` - The `tree-sitter generate` command will generate a Tree-sitter parser based on the grammar in the current
working directory. See [the documentation] for more information.
* `test` - The `tree-sitter test` command will run the unit tests for the Tree-sitter parser in the current working directory. See [the documentation] for more information.
* `test` - The `tree-sitter test` command will run the unit tests for the Tree-sitter parser in the current working directory.
See [the documentation] for more information.
* `parse` - The `tree-sitter parse` command will parse a file (or list of files) using Tree-sitter parsers.

View file

@ -14,7 +14,6 @@ extern void tree_sitter_debug_message(const char *, size_t);
#define PAGESIZE 0x10000
#define MAX_HEAP_SIZE (4 * 1024 * 1024)
#define MIN(a, b) ((a) < (b) ? (a) : (b))
typedef struct {
size_t size;
@ -151,7 +150,7 @@ void *realloc(void *ptr, size_t new_size) {
return NULL;
}
size_t copy_size = MIN(region->size, new_size);
size_t copy_size = (region->size < new_size) ? region->size : new_size;
memcpy(result, &region->data, copy_size);
free(ptr);
return result;

View file

@ -73,9 +73,8 @@ The behaviors of these three files are described in the next section.
## Queries
Tree-sitter's syntax highlighting system is based on *tree queries*, which are a general system for pattern-matching on Tree-sitter's
syntax trees. See [this section][pattern matching] of the documentation for more information
about tree queries.
Tree-sitter's syntax highlighting system is based on *tree queries*, which are a general system for pattern-matching on
Tree-sitter's syntax trees. See [this section][pattern matching] of the documentation for more information about tree queries.
Syntax highlighting is controlled by *three* different types of query files that are usually included in the `queries` folder.
The default names for the query files use the `.scm` file. We chose this extension because it commonly used for files written

View file

@ -3,7 +3,8 @@
Tree-sitter can be used in conjunction with its [query language][query language] as a part of code navigation systems.
An example of such a system can be seen in the `tree-sitter tags` command, which emits a textual dump of the interesting
syntactic nodes in its file argument. A notable application of this is GitHub's support for [search-based code navigation][gh search].
This document exists to describe how to integrate with such systems, and how to extend this functionality to any language with a Tree-sitter grammar.
This document exists to describe how to integrate with such systems, and how to extend this functionality to any language
with a Tree-sitter grammar.
## Tagging and captures
@ -12,9 +13,9 @@ entities. Having found them, you use a syntax capture to label the entity and it
The essence of a given tag lies in two pieces of data: the _role_ of the entity that is matched
(i.e. whether it is a definition or a reference) and the _kind_ of that entity, which describes how the entity is used
(i.e. whether it's a class definition, function call, variable reference, and so on). Our convention is to use a syntax capture
following the `@role.kind` capture name format, and another inner capture, always called `@name`, that pulls out the name
of a given identifier.
(i.e. whether it's a class definition, function call, variable reference, and so on). Our convention is to use a syntax
capture following the `@role.kind` capture name format, and another inner capture, always called `@name`, that pulls out
the name of a given identifier.
You may optionally include a capture named `@doc` to bind a docstring. For convenience purposes, the tagging system provides
two built-in functions, `#select-adjacent!` and `#strip!` that are convenient for removing comment syntax from a docstring.

View file

@ -93,7 +93,8 @@ cargo xtask build-wasm-stdlib
This command looks for the [Wasi SDK][wasi_sdk] indicated by the `TREE_SITTER_WASI_SDK_PATH`
environment variable. If you don't have the binary, it can be downloaded from wasi-sdk's [releases][wasi-sdk-releases]
page.
page. Note that any changes to `crates/language/wasm/**` requires rebuilding the tree-sitter Wasm stdlib via
`cargo xtask build-wasm-stdlib`.
### Debugging

View file

@ -19,8 +19,8 @@ will attempt to build the parser in the current working directory.
### `-w/--wasm`
Compile the parser as a Wasm module. This command looks for the [Wasi SDK][wasi_sdk] indicated by the `TREE_SITTER_WASI_SDK_PATH`
environment variable. If you don't have the binary, the CLI will attempt to download it for you to `<CACHE_DIR>/tree-sitter/wasi-sdk/`, where
`<CACHE_DIR>` is resolved according to the [XDG base directory][XDG] or Window's [Known_Folder_Locations][Known_Folder].
environment variable. If you don't have the binary, the CLI will attempt to download it for you to `<CACHE_DIR>/tree-sitter/wasi-sdk/`,
where `<CACHE_DIR>` is resolved according to the [XDG base directory][XDG] or Window's [Known_Folder_Locations][Known_Folder].
### `-o/--output`
@ -37,7 +37,8 @@ in the external scanner does so using their allocator.
### `-0/--debug`
Compile the parser with debug flags enabled. This is useful when debugging issues that require a debugger like `gdb` or `lldb`.
Compile the parser with debug flags enabled. This is useful when debugging issues that require a debugger like `gdb` or
`lldb`.
[Known_Folder]: https://learn.microsoft.com/en-us/windows/win32/shell/knownfolderid
[wasi_sdk]: https://github.com/WebAssembly/wasi-sdk

View file

@ -1,6 +1,8 @@
# `tree-sitter dump-languages`
The `dump-languages` command prints out a list of all the languages that the CLI knows about. This can be useful for debugging purposes, or for scripting. The paths to search comes from the config file's [`parser-directories`][parser-directories] object.
The `dump-languages` command prints out a list of all the languages that the CLI knows about. This can be useful for debugging
purposes, or for scripting. The paths to search comes from the config file's [`parser-directories`][parser-directories]
object.
```bash
tree-sitter dump-languages [OPTIONS] # Aliases: langs
@ -10,6 +12,7 @@ tree-sitter dump-languages [OPTIONS] # Aliases: langs
### `--config-path`
The path to the configuration file. Ordinarily, the CLI will use the default location as explained in the [init-config](./init-config.md) command. This flag allows you to explicitly override that default, and use a config defined elsewhere.
The path to the configuration file. Ordinarily, the CLI will use the default location as explained in the [init-config](./init-config.md)
command. This flag allows you to explicitly override that default, and use a config defined elsewhere.
[parser-directories]: ./init-config.md#parser-directories

View file

@ -1,6 +1,7 @@
# `tree-sitter generate`
The most important command for grammar development is `tree-sitter generate`, which reads the grammar in structured form and outputs C files that can be compiled into a shared or static library (e.g., using the [`build`](./build.md) command).
The most important command for grammar development is `tree-sitter generate`, which reads the grammar in structured form
and outputs C files that can be compiled into a shared or static library (e.g., using the [`build`](./build.md) command).
```bash
tree-sitter generate [OPTIONS] [GRAMMAR_PATH] # Aliases: gen, g
@ -8,7 +9,8 @@ tree-sitter generate [OPTIONS] [GRAMMAR_PATH] # Aliases: gen, g
The optional `GRAMMAR_PATH` argument should point to the structured grammar, in one of two forms:
- `grammar.js` a (ESM or CJS) JavaScript file; if the argument is omitted, it defaults to `./grammar.js`.
- `grammar.json` a structured representation of the grammar that is created as a byproduct of `generate`; this can be used to regenerate a missing `parser.c` without requiring a JavaScript runtime (useful when distributing parsers to consumers).
- `grammar.json` a structured representation of the grammar that is created as a byproduct of `generate`; this can be used
to regenerate a missing `parser.c` without requiring a JavaScript runtime (useful when distributing parsers to consumers).
If there is an ambiguity or *local ambiguity* in your grammar, Tree-sitter will detect it during parser generation, and
it will exit with a `Unresolved conflict` error message. To learn more about conflicts and how to handle them, see
@ -21,7 +23,8 @@ in the user guide.
- `src/tree_sitter/parser.h` provides basic C definitions that are used in the generated `parser.c` file.
- `src/tree_sitter/alloc.h` provides memory allocation macros that can be used in an external scanner.
- `src/tree_sitter/array.h` provides array macros that can be used in an external scanner.
- `src/grammar.json` contains a structured representation of the grammar; can be used to regenerate the parser without having to re-evaluate the `grammar.js`.
- `src/grammar.json` contains a structured representation of the grammar; can be used to regenerate the parser without having
to re-evaluate the `grammar.js`.
- `src/node-types.json` provides type information about individual syntax nodes; see the section on [`Static Node Types`](../using-parsers/6-static-node-types.md).
@ -29,8 +32,8 @@ in the user guide.
### `-l/--log`
Print the log of the parser generation process. This includes information such as what tokens are included in the error recovery state,
what keywords were extracted, what states were split and why, and the entry point state.
Print the log of the parser generation process. This includes information such as what tokens are included in the error
recovery state, what keywords were extracted, what states were split and why, and the entry point state.
### `--abi <VERSION>`
@ -60,7 +63,8 @@ The path to the JavaScript runtime executable to use when generating the parser.
Note that you can also set this with `TREE_SITTER_JS_RUNTIME`. Starting from version 0.26, you can
also pass in `native` to use the experimental native QuickJS runtime that comes bundled with the CLI.
This avoids the dependency on a JavaScript runtime entirely. The native QuickJS runtime is compatible
with ESM as well as with CommonJS in strict mode. If your grammar depends on `npm` to install dependencies such as base grammars, the native runtime can be used *after* running `npm install`.
with ESM as well as with CommonJS in strict mode. If your grammar depends on `npm` to install dependencies such as base
grammars, the native runtime can be used *after* running `npm install`.
### `--disable-optimization`

View file

@ -52,7 +52,8 @@ The path to the directory containing the grammar.
### `--config-path <CONFIG_PATH>`
The path to an alternative configuration (`config.json`) file. See [the init-config command](./init-config.md) for more information.
The path to an alternative configuration (`config.json`) file. See [the init-config command](./init-config.md) for more
information.
### `-n/--test-number <TEST_NUMBER>`

View file

@ -1,6 +1,7 @@
# CLI Overview
The `tree-sitter` command-line interface is used to create, manage, test, and build tree-sitter parsers. It is controlled by
The `tree-sitter` command-line interface is used to create, manage, test, and build tree-sitter parsers. It is controlled
by
- a personal `tree-sitter/config.json` config file generated by [`tree-sitter init-config`](./init-config.md)
- a parser `tree-sitter.json` config file generated by [`tree-sitter init`](./init.md).

View file

@ -14,8 +14,11 @@ tree-sitter init [OPTIONS] # Aliases: i
The following required files are always created if missing:
- `tree-sitter.json` - The main configuration file that determines how `tree-sitter` interacts with the grammar. If missing, the `init` command will prompt the user for the required fields. See [below](./init.md#structure-of-tree-sitterjson) for the full documentation of the structure of this file.
- `package.json` - The `npm` manifest for the parser. This file is required for some `tree-sitter` subcommands, and if the grammar has dependencies (e.g., another published base grammar that this grammar extends).
- `tree-sitter.json` - The main configuration file that determines how `tree-sitter` interacts with the grammar. If missing,
the `init` command will prompt the user for the required fields. See [below](./init.md#structure-of-tree-sitterjson) for
the full documentation of the structure of this file.
- `package.json` - The `npm` manifest for the parser. This file is required for some `tree-sitter` subcommands, and if the
grammar has dependencies (e.g., another published base grammar that this grammar extends).
- `grammar.js` - An empty template for the main grammar file; see [the section on creating parsers](../2-creating-parser).
### Language bindings
@ -130,8 +133,8 @@ be picked up by the cli.
These keys help to decide whether the language applies to a given file:
- `file-types` — An array of filename suffix strings (not including the dot). The grammar will be used for files whose names end with one of
these suffixes. Note that the suffix may match an *entire* filename.
- `file-types` — An array of filename suffix strings (not including the dot). The grammar will be used for files whose names
end with one of these suffixes. Note that the suffix may match an *entire* filename.
- `first-line-regex` — A regex pattern that will be tested against the first line of a file
to determine whether this language applies to the file. If present, this regex will be used for any file whose
@ -188,7 +191,8 @@ Each key is a language name, and the value is a boolean.
Update outdated generated files, if possible.
**Note:** Existing files that may have been edited manually are _not_ updated in general. To force an update to such files, remove them and call `tree-sitter init -u` again.
**Note:** Existing files that may have been edited manually are _not_ updated in general. To force an update to such files,
remove them and call `tree-sitter init -u` again.
### `-p/--grammar-path <PATH>`

View file

@ -78,7 +78,8 @@ Suppress main output.
### `--edits <EDITS>...`
Apply edits after parsing the file. Edits are in the form of `row,col|position delcount insert_text` where row and col, or position are 0-indexed.
Apply edits after parsing the file. Edits are in the form of `row,col|position delcount insert_text` where row and col,
or position are 0-indexed.
### `--encoding <ENCODING>`
@ -95,7 +96,8 @@ Output parsing results in a JSON format.
### `--config-path <CONFIG_PATH>`
The path to an alternative configuration (`config.json`) file. See [the init-config command](./init-config.md) for more information.
The path to an alternative configuration (`config.json`) file. See [the init-config command](./init-config.md) for more
information.
### `-n/--test-number <TEST_NUMBER>`

View file

@ -7,8 +7,8 @@ tree-sitter playground [OPTIONS] # Aliases: play, pg, web-ui
```
```admonish note
For this to work, you must have already built the parser as a Wasm module. This can be done with the [`build`](./build.md) subcommand
(`tree-sitter build --wasm`).
For this to work, you must have already built the parser as a Wasm module. This can be done with the [`build`](./build.md)
subcommand (`tree-sitter build --wasm`).
```
## Options

View file

@ -47,8 +47,8 @@ The range of rows in which the query will be executed. The format is `start_row:
### `--containing-row-range <ROW_RANGE>`
The range of rows in which the query will be executed. Only the matches that are fully contained within the provided row range
will be returned.
The range of rows in which the query will be executed. Only the matches that are fully contained within the provided row
range will be returned.
### `--scope <SCOPE>`
@ -64,7 +64,8 @@ Whether to run query tests or not.
### `--config-path <CONFIG_PATH>`
The path to an alternative configuration (`config.json`) file. See [the init-config command](./init-config.md) for more information.
The path to an alternative configuration (`config.json`) file. See [the init-config command](./init-config.md) for more
information.
### `-n/--test-number <TEST_NUMBER>`

View file

@ -31,7 +31,8 @@ The path to the directory containing the grammar.
### `--config-path <CONFIG_PATH>`
The path to an alternative configuration (`config.json`) file. See [the init-config command](./init-config.md) for more information.
The path to an alternative configuration (`config.json`) file. See [the init-config command](./init-config.md) for more
information.
### `-n/--test-number <TEST_NUMBER>`

View file

@ -63,7 +63,8 @@ When using the `--debug-graph` option, open the log file in the default browser.
### `--config-path <CONFIG_PATH>`
The path to an alternative configuration (`config.json`) file. See [the init-config command](./init-config.md) for more information.
The path to an alternative configuration (`config.json`) file. See [the init-config command](./init-config.md) for more
information.
### `--show-fields`

View file

@ -25,11 +25,9 @@ tree-sitter version --bump minor # minor bump
tree-sitter version --bump major # major bump
```
As a grammar author, you should keep the version of your grammar in sync across
different bindings. However, doing so manually is error-prone and tedious, so
this command takes care of the burden. If you are using a version control system,
it is recommended to commit the changes made by this command, and to tag the
commit with the new version.
As a grammar author, you should keep the version of your grammar in sync across different bindings. However, doing so manually
is error-prone and tedious, so this command takes care of the burden. If you are using a version control system, it is recommended
to commit the changes made by this command, and to tag the commit with the new version.
To print the current version without bumping it, use:

View file

@ -17,8 +17,8 @@ DSL through the `RustRegex` class. Simply pass your regex pattern as a string:
```
Unlike JavaScript's builtin `RegExp` class, which takes a pattern and flags as separate arguments, `RustRegex` only
accepts a single pattern string. While it doesn't support separate flags, you can use inline flags within the pattern itself.
For more details about Rust's regex syntax and capabilities, check out the [Rust regex documentation][rust regex].
accepts a single pattern string. While it doesn't support separate flags, you can use inline flags within the pattern
itself. For more details about Rust's regex syntax and capabilities, check out the [Rust regex documentation][rust regex].
```admonish note
Only a subset of the Regex engine is actually supported. This is due to certain features like lookahead and lookaround
@ -50,10 +50,10 @@ The previous `repeat` rule is implemented in `repeat1` but is included because i
- **Options : `optional(rule)`** — This function creates a rule that matches *zero or one* occurrence of a given rule.
It is analogous to the `[x]` (square bracket) syntax in EBNF notation.
- **Precedence : `prec(number, rule)`** — This function marks the given rule with a numerical precedence, which will be used
to resolve [*LR(1) Conflicts*][lr-conflict] at parser-generation time. When two rules overlap in a way that represents either
a true ambiguity or a *local* ambiguity given one token of lookahead, Tree-sitter will try to resolve the conflict by matching
the rule with the higher precedence. The default precedence of all rules is zero. This works similarly to the
- **Precedence : `prec(number, rule)`** — This function marks the given rule with a numerical precedence, which will be
used to resolve [*LR(1) Conflicts*][lr-conflict] at parser-generation time. When two rules overlap in a way that represents
either a true ambiguity or a *local* ambiguity given one token of lookahead, Tree-sitter will try to resolve the conflict
by matching the rule with the higher precedence. The default precedence of all rules is zero. This works similarly to the
[precedence directives][yacc-prec] in Yacc grammars.
This function can also be used to assign lexical precedence to a given
@ -115,8 +115,8 @@ want to create syntax tree nodes at runtime.
- **`conflicts`** — an array of arrays of rule names. Each inner array represents a set of rules that's involved in an
*LR(1) conflict* that is *intended to exist* in the grammar. When these conflicts occur at runtime, Tree-sitter will use
the GLR algorithm to explore all the possible interpretations. If *multiple* parses end up succeeding, Tree-sitter will pick
the subtree whose corresponding rule has the highest total *dynamic precedence*.
the GLR algorithm to explore all the possible interpretations. If *multiple* parses end up succeeding, Tree-sitter will
pick the subtree whose corresponding rule has the highest total *dynamic precedence*.
- **`externals`** — an array of token names which can be returned by an
[*external scanner*][external-scanners]. External scanners allow you to write custom C code which runs during the lexing
@ -139,10 +139,10 @@ for more details.
array of reserved rules. The reserved rule in the array must be a terminal token meaning it must be a string, regex, token,
or terminal rule. The reserved rule must also exist and be used in the grammar, specifying arbitrary tokens will not work.
The *first* reserved word set in the object is the global word set, meaning it applies to every rule in every parse state.
However, certain keywords are contextual, depending on the rule. For example, in JavaScript, keywords are typically not allowed
as ordinary variables, however, they *can* be used as a property name. In this situation, the `reserved` function would be used,
and the word set to pass in would be the name of the word set that is declared in the `reserved` object that corresponds to an
empty array, signifying *no* keywords are reserved.
However, certain keywords are contextual, depending on the rule. For example, in JavaScript, keywords are typically not
allowed as ordinary variables, however, they *can* be used as a property name. In this situation, the `reserved` function
would be used, and the word set to pass in would be the name of the word set that is declared in the `reserved` object that
corresponds to an empty array, signifying *no* keywords are reserved.
[bison-dprec]: https://www.gnu.org/software/bison/manual/html_node/Generalized-LR-Parsing.html
[ebnf]: https://en.wikipedia.org/wiki/Extended_Backus%E2%80%93Naur_form

View file

@ -1,7 +1,7 @@
# Writing the Grammar
Writing a grammar requires creativity. There are an infinite number of CFGs (context-free grammars) that can be used to describe
any given language. To produce a good Tree-sitter parser, you need to create a grammar with two important properties:
Writing a grammar requires creativity. There are an infinite number of CFGs (context-free grammars) that can be used to
describe any given language. To produce a good Tree-sitter parser, you need to create a grammar with two important properties:
1. **An intuitive structure** — Tree-sitter's output is a [concrete syntax tree][cst]; each node in the tree corresponds
directly to a [terminal or non-terminal symbol][non-terminal] in the grammar. So to produce an easy-to-analyze tree, there
@ -139,8 +139,8 @@ instead. It's often useful to check your progress by trying to parse some real c
## Structuring Rules Well
Imagine that you were just starting work on the [Tree-sitter JavaScript parser][tree-sitter-javascript]. Naively, you might
try to directly mirror the structure of the [ECMAScript Language Spec][ecmascript-spec]. To illustrate the problem with this
approach, consider the following line of code:
try to directly mirror the structure of the [ECMAScript Language Spec][ecmascript-spec]. To illustrate the problem with
this approach, consider the following line of code:
```js
return x + y;
@ -181,16 +181,17 @@ which are unrelated to the actual code.
## Standard Rule Names
Tree-sitter places no restrictions on how to name the rules of your grammar. It can be helpful, however, to follow certain conventions
used by many other established grammars in the ecosystem. Some of these well-established patterns are listed below:
Tree-sitter places no restrictions on how to name the rules of your grammar. It can be helpful, however, to follow certain
conventions used by many other established grammars in the ecosystem. Some of these well-established patterns are listed
below:
- `source_file`: Represents an entire source file, this rule is commonly used as the root node for a grammar,
- `expression`/`statement`: Used to represent statements and expressions for a given language. Commonly defined as a choice between several
more specific sub-expression/sub-statement rules.
- `expression`/`statement`: Used to represent statements and expressions for a given language. Commonly defined as a choice
between several more specific sub-expression/sub-statement rules.
- `block`: Used as the parent node for block scopes, with its children representing the block's contents.
- `type`: Represents the types of a language such as `int`, `char`, and `void`.
- `identifier`: Used for constructs like variable names, function arguments, and object fields; this rule is commonly used as the `word`
token in grammars.
- `identifier`: Used for constructs like variable names, function arguments, and object fields; this rule is commonly used
as the `word` token in grammars.
- `string`: Used to represent `"string literals"`.
- `comment`: Used to represent comments, this rule is commonly used as an `extra`.
@ -308,9 +309,9 @@ This is where `prec.left` and `prec.right` come into use. We want to select the
## Using Conflicts
Sometimes, conflicts are actually desirable. In our JavaScript grammar, expressions and patterns can create intentional ambiguity.
A construct like `[x, y]` could be legitimately parsed as both an array literal (like in `let a = [x, y]`) or as a destructuring
pattern (like in `let [x, y] = arr`).
Sometimes, conflicts are actually desirable. In our JavaScript grammar, expressions and patterns can create intentional
ambiguity. A construct like `[x, y]` could be legitimately parsed as both an array literal (like in `let a = [x, y]`) or
as a destructuring pattern (like in `let [x, y] = arr`).
```js
export default grammar({
@ -564,8 +565,8 @@ as mentioned in the previous page, is `token(prec(N, ...))`.
## Keywords
Many languages have a set of _keyword_ tokens (e.g. `if`, `for`, `return`), as well as a more general token (e.g. `identifier`)
that matches any word, including many of the keyword strings. For example, JavaScript has a keyword `instanceof`, which is
used as a binary operator, like this:
that matches any word, including many of the keyword strings. For example, JavaScript has a keyword `instanceof`, which
is used as a binary operator, like this:
```js
if (a instanceof Something) b();

View file

@ -143,10 +143,10 @@ the second argument, the current character will be treated as whitespace; whites
associated with tokens emitted by the external scanner.
- **`void (*mark_end)(TSLexer *)`** — A function for marking the end of the recognized token. This allows matching tokens
that require multiple characters of lookahead. By default, (if you don't call `mark_end`), any character that you moved past
using the `advance` function will be included in the size of the token. But once you call `mark_end`, then any later calls
to `advance` will _not_ increase the size of the returned token. You can call `mark_end` multiple times to increase the size
of the token.
that require multiple characters of lookahead. By default, (if you don't call `mark_end`), any character that you moved
past using the `advance` function will be included in the size of the token. But once you call `mark_end`, then any later
calls to `advance` will _not_ increase the size of the returned token. You can call `mark_end` multiple times to increase
the size of the token.
- **`uint32_t (*get_column)(TSLexer *)`** — A function for querying the current column position of the lexer. It returns
the number of codepoints since the start of the current line. The codepoint position is recalculated on every call to this
@ -185,9 +185,9 @@ if (valid_symbols[INDENT] || valid_symbols[DEDENT]) {
### Allocator
Instead of using libc's `malloc`, `calloc`, `realloc`, and `free`, you should use the versions prefixed with `ts_` from `tree_sitter/alloc.h`.
These macros can allow a potential consumer to override the default allocator with their own implementation, but by default
will use the libc functions.
Instead of using libc's `malloc`, `calloc`, `realloc`, and `free`, you should use the versions prefixed with `ts_` from
`tree_sitter/alloc.h`. These macros can allow a potential consumer to override the default allocator with their own implementation,
but by default will use the libc functions.
As a consumer of the tree-sitter core library as well as any parser libraries that might use allocations, you can enable
overriding the default allocator and have it use the same one as the library allocator, of which you can set with `ts_set_allocator`.
@ -195,7 +195,8 @@ To enable this overriding in scanners, you must compile them with the `TREE_SITT
the library must be linked into your final app dynamically, since it needs to resolve the internal functions at runtime.
If you are compiling an executable binary that uses the core library, but want to load parsers dynamically at runtime, then
you will have to use a special linker flag on Unix. For non-Darwin systems, that would be `--dynamic-list` and for Darwin
systems, that would be `-exported_symbols_list`. The CLI does exactly this, so you can use it as a reference (check out `cli/build.rs`).
systems, that would be `-exported_symbols_list`. The CLI does exactly this, so you can use it as a reference (check out
`cli/build.rs`).
For example, assuming you wanted to allocate 100 bytes for your scanner, you'd do so like the following example:
@ -293,9 +294,10 @@ bool tree_sitter_my_language_external_scanner_scan(
## Other External Scanner Details
External scanners have priority over Tree-sitter's normal lexing process. When a token listed in the externals array is valid
at a given position, the external scanner is called first. This makes external scanners a powerful way to override Tree-sitter's
default lexing behavior, especially for cases that can't be handled with regular lexical rules, parsing, or dynamic precedence.
External scanners have priority over Tree-sitter's normal lexing process. When a token listed in the externals array is
valid at a given position, the external scanner is called first. This makes external scanners a powerful way to override
Tree-sitter's default lexing behavior, especially for cases that can't be handled with regular lexical rules, parsing, or
dynamic precedence.
During error recovery, Tree-sitter's first step is to call the external scanner's scan function with all tokens marked as
valid. Your scanner should detect and handle this case appropriately. One simple approach is to add an unused "sentinel"

View file

@ -39,8 +39,8 @@ It only shows the *named* nodes, as described in [this section][named-vs-anonymo
```
The expected output section can also *optionally* show the [*field names*][node-field-names] associated with each child
node. To include field names in your tests, you write a node's field name followed by a colon, before the node itself in
the S-expression:
node. To include field names in your tests, you write a node's field name followed by a colon, before the node itself
in the S-expression:
```query
(source_file
@ -104,8 +104,8 @@ you can repeat the attribute on a new line.
The following attributes are available:
* `:cst` - This attribute specifies that the expected output should be in the form of a CST instead of the normal S-expression. This
CST matches the format given by `parse --cst`.
* `:cst` - This attribute specifies that the expected output should be in the form of a CST instead of the normal S-expression.
This CST matches the format given by `parse --cst`.
* `:error` — This attribute will assert that the parse tree contains an error. It's useful to just validate that a certain
input is invalid without displaying the whole parse tree, as such you should omit the parse tree below the `---` line.
* `:fail-fast` — This attribute will stop the testing of additional cases if the test marked with this attribute fails.

View file

@ -1,4 +1,4 @@
# Creating parsers
Developing Tree-sitter grammars can have a difficult learning curve, but once you get the hang of it, it can be fun and even
zen-like. This document will help you to get started and to develop a useful mental model.
Developing Tree-sitter grammars can have a difficult learning curve, but once you get the hang of it, it can be fun and
even zen-like. This document will help you to get started and to develop a useful mental model.

View file

@ -10,7 +10,8 @@ file and efficiently update the syntax tree as the source file is edited. Tree-s
- **General** enough to parse any programming language
- **Fast** enough to parse on every keystroke in a text editor
- **Robust** enough to provide useful results even in the presence of syntax errors
- **Dependency-free** so that the runtime library (which is written in pure [C11](https://github.com/tree-sitter/tree-sitter/tree/master/lib)) can be embedded in any application
- **Dependency-free** so that the runtime library (which is written in pure [C11](https://github.com/tree-sitter/tree-sitter/tree/master/lib))
can be embedded in any application
## Language Bindings

View file

@ -2,7 +2,8 @@
## Providing the Code
In the example on the previous page, we parsed source code stored in a simple string using the `ts_parser_parse_string` function:
In the example on the previous page, we parsed source code stored in a simple string using the `ts_parser_parse_string`
function:
```c
TSTree *ts_parser_parse_string(
@ -135,10 +136,10 @@ Consider a grammar rule like this:
if_statement: $ => seq("if", "(", $._expression, ")", $._statement);
```
A syntax node representing an `if_statement` in this language would have 5 children: the condition expression, the body statement,
as well as the `if`, `(`, and `)` tokens. The expression and the statement would be marked as _named_ nodes, because they
have been given explicit names in the grammar. But the `if`, `(`, and `)` nodes would _not_ be named nodes, because they
are represented in the grammar as simple strings.
A syntax node representing an `if_statement` in this language would have 5 children: the condition expression, the body
statement, as well as the `if`, `(`, and `)` tokens. The expression and the statement would be marked as _named_ nodes,
because they have been given explicit names in the grammar. But the `if`, `(`, and `)` nodes would _not_ be named nodes,
because they are represented in the grammar as simple strings.
You can check whether any given node is named:

View file

@ -19,8 +19,8 @@ typedef struct {
void ts_tree_edit(TSTree *, const TSInputEdit *);
```
Then, you can call `ts_parser_parse` again, passing in the old tree. This will create a new tree that internally shares structure
with the old tree.
Then, you can call `ts_parser_parse` again, passing in the old tree. This will create a new tree that internally shares
structure with the old tree.
When you edit a syntax tree, the positions of its nodes will change. If you have stored any `TSNode` instances outside of
the `TSTree`, you must update their positions separately, using the same `TSInputEdit` value, in order to update their

View file

@ -108,9 +108,9 @@ In Tree-sitter grammars, there are usually certain rules that represent abstract
"type", "declaration"). In the `grammar.js` file, these are often written as [hidden rules][hidden rules]
whose definition is a simple [`choice`][grammar dsl] where each member is just a single symbol.
Normally, hidden rules are not mentioned in the node types file, since they don't appear in the syntax tree. But if you add
a hidden rule to the grammar's [`supertypes` list][grammar dsl], then it _will_ show up in the node
types file, with the following special entry:
Normally, hidden rules are not mentioned in the node types file, since they don't appear in the syntax tree. But if you
add a hidden rule to the grammar's [`supertypes` list][grammar dsl], then it _will_ show up in the node types file, with
the following special entry:
- `"subtypes"` — An array of objects that specify the _types_ of nodes that this 'supertype' node can wrap.

View file

@ -15,8 +15,11 @@ A given version of the tree-sitter library is only able to load parsers generate
| >=0.20.3, <=0.24 | 13 | 14 |
| >=0.25 | 13 | 15 |
By default, the tree-sitter CLI will generate parsers using the latest available ABI for that version, but an older ABI (supported by the CLI) can be selected by passing the [`--abi` option][abi_option] to the `generate` command.
By default, the tree-sitter CLI will generate parsers using the latest available ABI for that version, but an older ABI
(supported by the CLI) can be selected by passing the [`--abi` option][abi_option] to the `generate` command.
Note that the ABI version range supported by the CLI can be smaller than for the library: When a new ABI version is released, older versions will be phased out over a deprecation period, which starts with no longer being able to generate parsers with the oldest ABI version.
Note that the ABI version range supported by the CLI can be smaller than for the library: When a new ABI version is released,
older versions will be phased out over a deprecation period, which starts with no longer being able to generate parsers
with the oldest ABI version.
[abi_option]: ../cli/generate.md#--abi-version

View file

@ -6,8 +6,8 @@ the core concepts remain the same.
Tree-sitter's parsing functionality is implemented through its C API, with all functions documented in the [tree_sitter/api.h][api.h]
header file, but if you're working in another language, you can use one of the following bindings found [here](../index.md#language-bindings),
each providing idiomatic access to Tree-sitter's functionality. Of these bindings, the official ones have their own API docs
hosted online at the following pages:
each providing idiomatic access to Tree-sitter's functionality. Of these bindings, the official ones have their own API
doc hosted online at the following pages:
- [Go][go]
- [Java]

View file

@ -1,9 +1,9 @@
# Query Syntax
A _query_ consists of one or more _patterns_, where each pattern is an [S-expression][s-exp] that matches a certain set of
nodes in a syntax tree. The expression to match a given node consists of a pair of parentheses containing two things: the
node's type, and optionally, a series of other S-expressions that match the node's children. For example, this pattern would
match any `binary_expression` node whose children are both `number_literal` nodes:
A _query_ consists of one or more _patterns_, where each pattern is an [S-expression][s-exp] that matches a certain set
of nodes in a syntax tree. The expression to match a given node consists of a pair of parentheses containing two things:
the node's type, and optionally, a series of other S-expressions that match the node's children. For example, this pattern
would match any `binary_expression` node whose children are both `number_literal` nodes:
```query
(binary_expression (number_literal) (number_literal))
@ -99,10 +99,10 @@ by `(ERROR)` queries. Specific missing node types can also be queried:
### Supertype Nodes
Some node types are marked as _supertypes_ in a grammar. A supertype is a node type that contains multiple
subtypes. For example, in the [JavaScript grammar example][grammar], `expression` is a supertype that can represent any kind
of expression, such as a `binary_expression`, `call_expression`, or `identifier`. You can use supertypes in queries to match
any of their subtypes, rather than having to list out each subtype individually. For example, this pattern would match any
kind of expression, even though it's not a visible node in the syntax tree:
subtypes. For example, in the [JavaScript grammar example][grammar], `expression` is a supertype that can represent any
kind of expression, such as a `binary_expression`, `call_expression`, or `identifier`. You can use supertypes in queries
to match any of their subtypes, rather than having to list out each subtype individually. For example, this pattern would
match any kind of expression, even though it's not a visible node in the syntax tree:
```query
(expression) @any-expression

View file

@ -128,15 +128,15 @@ This pattern would match any builtin variable that is not a local variable, beca
# Directives
Similar to predicates, directives are a way to associate arbitrary metadata with a pattern. The only difference between predicates
and directives is that directives end in a `!` character instead of `?` character.
Similar to predicates, directives are a way to associate arbitrary metadata with a pattern. The only difference between
predicates and directives is that directives end in a `!` character instead of `?` character.
Tree-sitter's CLI supports the following directives by default:
## The `set!` directive
This directive allows you to associate key-value pairs with a pattern. The key and value can be any arbitrary text that you
see fit.
This directive allows you to associate key-value pairs with a pattern. The key and value can be any arbitrary text that
you see fit.
```query
((comment) @injection.content
@ -156,8 +156,8 @@ another capture are preserved. It takes two arguments, both of which are capture
### The `#strip!` directive
The `#strip!` directive allows you to remove text from a capture. It takes two arguments: the first is the capture to strip
text from, and the second is a regular expression to match against the text. Any text matched by the regular expression will
be removed from the text associated with the capture.
text from, and the second is a regular expression to match against the text. Any text matched by the regular expression
will be removed from the text associated with the capture.
For an example on the `#select-adjacent!` and `#strip!` directives,
view the [code navigation](../../4-code-navigation.md#examples) documentation.