Merge remote-tracking branch 'upstream/master' into fix/wasm32-malloc
This commit is contained in:
commit
5fbb1b1ebd
30 changed files with 146 additions and 118 deletions
|
|
@ -7,7 +7,8 @@
|
|||
[npmjs.com]: https://www.npmjs.org/package/tree-sitter-cli
|
||||
[npmjs.com badge]: https://img.shields.io/npm/v/tree-sitter-cli.svg?color=%23BF4A4A
|
||||
|
||||
The Tree-sitter CLI allows you to develop, test, and use Tree-sitter grammars from the command line. It works on `MacOS`, `Linux`, and `Windows`.
|
||||
The Tree-sitter CLI allows you to develop, test, and use Tree-sitter grammars from the command line. It works on `MacOS`,
|
||||
`Linux`, and `Windows`.
|
||||
|
||||
### Installation
|
||||
|
||||
|
|
@ -34,9 +35,11 @@ The `tree-sitter` binary itself has no dependencies, but specific commands have
|
|||
|
||||
### Commands
|
||||
|
||||
* `generate` - The `tree-sitter generate` command will generate a Tree-sitter parser based on the grammar in the current working directory. See [the documentation] for more information.
|
||||
* `generate` - The `tree-sitter generate` command will generate a Tree-sitter parser based on the grammar in the current
|
||||
working directory. See [the documentation] for more information.
|
||||
|
||||
* `test` - The `tree-sitter test` command will run the unit tests for the Tree-sitter parser in the current working directory. See [the documentation] for more information.
|
||||
* `test` - The `tree-sitter test` command will run the unit tests for the Tree-sitter parser in the current working directory.
|
||||
See [the documentation] for more information.
|
||||
|
||||
* `parse` - The `tree-sitter parse` command will parse a file (or list of files) using Tree-sitter parsers.
|
||||
|
||||
|
|
|
|||
|
|
@ -14,7 +14,6 @@ extern void tree_sitter_debug_message(const char *, size_t);
|
|||
|
||||
#define PAGESIZE 0x10000
|
||||
#define MAX_HEAP_SIZE (4 * 1024 * 1024)
|
||||
#define MIN(a, b) ((a) < (b) ? (a) : (b))
|
||||
|
||||
typedef struct {
|
||||
size_t size;
|
||||
|
|
@ -151,7 +150,7 @@ void *realloc(void *ptr, size_t new_size) {
|
|||
return NULL;
|
||||
}
|
||||
|
||||
size_t copy_size = MIN(region->size, new_size);
|
||||
size_t copy_size = (region->size < new_size) ? region->size : new_size;
|
||||
memcpy(result, ®ion->data, copy_size);
|
||||
free(ptr);
|
||||
return result;
|
||||
|
|
|
|||
|
|
@ -73,9 +73,8 @@ The behaviors of these three files are described in the next section.
|
|||
|
||||
## Queries
|
||||
|
||||
Tree-sitter's syntax highlighting system is based on *tree queries*, which are a general system for pattern-matching on Tree-sitter's
|
||||
syntax trees. See [this section][pattern matching] of the documentation for more information
|
||||
about tree queries.
|
||||
Tree-sitter's syntax highlighting system is based on *tree queries*, which are a general system for pattern-matching on
|
||||
Tree-sitter's syntax trees. See [this section][pattern matching] of the documentation for more information about tree queries.
|
||||
|
||||
Syntax highlighting is controlled by *three* different types of query files that are usually included in the `queries` folder.
|
||||
The default names for the query files use the `.scm` file. We chose this extension because it commonly used for files written
|
||||
|
|
|
|||
|
|
@ -3,7 +3,8 @@
|
|||
Tree-sitter can be used in conjunction with its [query language][query language] as a part of code navigation systems.
|
||||
An example of such a system can be seen in the `tree-sitter tags` command, which emits a textual dump of the interesting
|
||||
syntactic nodes in its file argument. A notable application of this is GitHub's support for [search-based code navigation][gh search].
|
||||
This document exists to describe how to integrate with such systems, and how to extend this functionality to any language with a Tree-sitter grammar.
|
||||
This document exists to describe how to integrate with such systems, and how to extend this functionality to any language
|
||||
with a Tree-sitter grammar.
|
||||
|
||||
## Tagging and captures
|
||||
|
||||
|
|
@ -12,9 +13,9 @@ entities. Having found them, you use a syntax capture to label the entity and it
|
|||
|
||||
The essence of a given tag lies in two pieces of data: the _role_ of the entity that is matched
|
||||
(i.e. whether it is a definition or a reference) and the _kind_ of that entity, which describes how the entity is used
|
||||
(i.e. whether it's a class definition, function call, variable reference, and so on). Our convention is to use a syntax capture
|
||||
following the `@role.kind` capture name format, and another inner capture, always called `@name`, that pulls out the name
|
||||
of a given identifier.
|
||||
(i.e. whether it's a class definition, function call, variable reference, and so on). Our convention is to use a syntax
|
||||
capture following the `@role.kind` capture name format, and another inner capture, always called `@name`, that pulls out
|
||||
the name of a given identifier.
|
||||
|
||||
You may optionally include a capture named `@doc` to bind a docstring. For convenience purposes, the tagging system provides
|
||||
two built-in functions, `#select-adjacent!` and `#strip!` that are convenient for removing comment syntax from a docstring.
|
||||
|
|
|
|||
|
|
@ -93,7 +93,8 @@ cargo xtask build-wasm-stdlib
|
|||
|
||||
This command looks for the [Wasi SDK][wasi_sdk] indicated by the `TREE_SITTER_WASI_SDK_PATH`
|
||||
environment variable. If you don't have the binary, it can be downloaded from wasi-sdk's [releases][wasi-sdk-releases]
|
||||
page.
|
||||
page. Note that any changes to `crates/language/wasm/**` requires rebuilding the tree-sitter Wasm stdlib via
|
||||
`cargo xtask build-wasm-stdlib`.
|
||||
|
||||
### Debugging
|
||||
|
||||
|
|
|
|||
|
|
@ -19,8 +19,8 @@ will attempt to build the parser in the current working directory.
|
|||
### `-w/--wasm`
|
||||
|
||||
Compile the parser as a Wasm module. This command looks for the [Wasi SDK][wasi_sdk] indicated by the `TREE_SITTER_WASI_SDK_PATH`
|
||||
environment variable. If you don't have the binary, the CLI will attempt to download it for you to `<CACHE_DIR>/tree-sitter/wasi-sdk/`, where
|
||||
`<CACHE_DIR>` is resolved according to the [XDG base directory][XDG] or Window's [Known_Folder_Locations][Known_Folder].
|
||||
environment variable. If you don't have the binary, the CLI will attempt to download it for you to `<CACHE_DIR>/tree-sitter/wasi-sdk/`,
|
||||
where `<CACHE_DIR>` is resolved according to the [XDG base directory][XDG] or Window's [Known_Folder_Locations][Known_Folder].
|
||||
|
||||
### `-o/--output`
|
||||
|
||||
|
|
@ -37,7 +37,8 @@ in the external scanner does so using their allocator.
|
|||
|
||||
### `-0/--debug`
|
||||
|
||||
Compile the parser with debug flags enabled. This is useful when debugging issues that require a debugger like `gdb` or `lldb`.
|
||||
Compile the parser with debug flags enabled. This is useful when debugging issues that require a debugger like `gdb` or
|
||||
`lldb`.
|
||||
|
||||
[Known_Folder]: https://learn.microsoft.com/en-us/windows/win32/shell/knownfolderid
|
||||
[wasi_sdk]: https://github.com/WebAssembly/wasi-sdk
|
||||
|
|
|
|||
|
|
@ -1,6 +1,8 @@
|
|||
# `tree-sitter dump-languages`
|
||||
|
||||
The `dump-languages` command prints out a list of all the languages that the CLI knows about. This can be useful for debugging purposes, or for scripting. The paths to search comes from the config file's [`parser-directories`][parser-directories] object.
|
||||
The `dump-languages` command prints out a list of all the languages that the CLI knows about. This can be useful for debugging
|
||||
purposes, or for scripting. The paths to search comes from the config file's [`parser-directories`][parser-directories]
|
||||
object.
|
||||
|
||||
```bash
|
||||
tree-sitter dump-languages [OPTIONS] # Aliases: langs
|
||||
|
|
@ -10,6 +12,7 @@ tree-sitter dump-languages [OPTIONS] # Aliases: langs
|
|||
|
||||
### `--config-path`
|
||||
|
||||
The path to the configuration file. Ordinarily, the CLI will use the default location as explained in the [init-config](./init-config.md) command. This flag allows you to explicitly override that default, and use a config defined elsewhere.
|
||||
The path to the configuration file. Ordinarily, the CLI will use the default location as explained in the [init-config](./init-config.md)
|
||||
command. This flag allows you to explicitly override that default, and use a config defined elsewhere.
|
||||
|
||||
[parser-directories]: ./init-config.md#parser-directories
|
||||
|
|
|
|||
|
|
@ -1,6 +1,7 @@
|
|||
# `tree-sitter generate`
|
||||
|
||||
The most important command for grammar development is `tree-sitter generate`, which reads the grammar in structured form and outputs C files that can be compiled into a shared or static library (e.g., using the [`build`](./build.md) command).
|
||||
The most important command for grammar development is `tree-sitter generate`, which reads the grammar in structured form
|
||||
and outputs C files that can be compiled into a shared or static library (e.g., using the [`build`](./build.md) command).
|
||||
|
||||
```bash
|
||||
tree-sitter generate [OPTIONS] [GRAMMAR_PATH] # Aliases: gen, g
|
||||
|
|
@ -8,7 +9,8 @@ tree-sitter generate [OPTIONS] [GRAMMAR_PATH] # Aliases: gen, g
|
|||
|
||||
The optional `GRAMMAR_PATH` argument should point to the structured grammar, in one of two forms:
|
||||
- `grammar.js` a (ESM or CJS) JavaScript file; if the argument is omitted, it defaults to `./grammar.js`.
|
||||
- `grammar.json` a structured representation of the grammar that is created as a byproduct of `generate`; this can be used to regenerate a missing `parser.c` without requiring a JavaScript runtime (useful when distributing parsers to consumers).
|
||||
- `grammar.json` a structured representation of the grammar that is created as a byproduct of `generate`; this can be used
|
||||
to regenerate a missing `parser.c` without requiring a JavaScript runtime (useful when distributing parsers to consumers).
|
||||
|
||||
If there is an ambiguity or *local ambiguity* in your grammar, Tree-sitter will detect it during parser generation, and
|
||||
it will exit with a `Unresolved conflict` error message. To learn more about conflicts and how to handle them, see
|
||||
|
|
@ -21,7 +23,8 @@ in the user guide.
|
|||
- `src/tree_sitter/parser.h` provides basic C definitions that are used in the generated `parser.c` file.
|
||||
- `src/tree_sitter/alloc.h` provides memory allocation macros that can be used in an external scanner.
|
||||
- `src/tree_sitter/array.h` provides array macros that can be used in an external scanner.
|
||||
- `src/grammar.json` contains a structured representation of the grammar; can be used to regenerate the parser without having to re-evaluate the `grammar.js`.
|
||||
- `src/grammar.json` contains a structured representation of the grammar; can be used to regenerate the parser without having
|
||||
to re-evaluate the `grammar.js`.
|
||||
- `src/node-types.json` provides type information about individual syntax nodes; see the section on [`Static Node Types`](../using-parsers/6-static-node-types.md).
|
||||
|
||||
|
||||
|
|
@ -29,8 +32,8 @@ in the user guide.
|
|||
|
||||
### `-l/--log`
|
||||
|
||||
Print the log of the parser generation process. This includes information such as what tokens are included in the error recovery state,
|
||||
what keywords were extracted, what states were split and why, and the entry point state.
|
||||
Print the log of the parser generation process. This includes information such as what tokens are included in the error
|
||||
recovery state, what keywords were extracted, what states were split and why, and the entry point state.
|
||||
|
||||
### `--abi <VERSION>`
|
||||
|
||||
|
|
@ -60,7 +63,8 @@ The path to the JavaScript runtime executable to use when generating the parser.
|
|||
Note that you can also set this with `TREE_SITTER_JS_RUNTIME`. Starting from version 0.26, you can
|
||||
also pass in `native` to use the experimental native QuickJS runtime that comes bundled with the CLI.
|
||||
This avoids the dependency on a JavaScript runtime entirely. The native QuickJS runtime is compatible
|
||||
with ESM as well as with CommonJS in strict mode. If your grammar depends on `npm` to install dependencies such as base grammars, the native runtime can be used *after* running `npm install`.
|
||||
with ESM as well as with CommonJS in strict mode. If your grammar depends on `npm` to install dependencies such as base
|
||||
grammars, the native runtime can be used *after* running `npm install`.
|
||||
|
||||
### `--disable-optimization`
|
||||
|
||||
|
|
|
|||
|
|
@ -52,7 +52,8 @@ The path to the directory containing the grammar.
|
|||
|
||||
### `--config-path <CONFIG_PATH>`
|
||||
|
||||
The path to an alternative configuration (`config.json`) file. See [the init-config command](./init-config.md) for more information.
|
||||
The path to an alternative configuration (`config.json`) file. See [the init-config command](./init-config.md) for more
|
||||
information.
|
||||
|
||||
### `-n/--test-number <TEST_NUMBER>`
|
||||
|
||||
|
|
|
|||
|
|
@ -1,6 +1,7 @@
|
|||
# CLI Overview
|
||||
|
||||
The `tree-sitter` command-line interface is used to create, manage, test, and build tree-sitter parsers. It is controlled by
|
||||
The `tree-sitter` command-line interface is used to create, manage, test, and build tree-sitter parsers. It is controlled
|
||||
by
|
||||
|
||||
- a personal `tree-sitter/config.json` config file generated by [`tree-sitter init-config`](./init-config.md)
|
||||
- a parser `tree-sitter.json` config file generated by [`tree-sitter init`](./init.md).
|
||||
|
|
|
|||
|
|
@ -14,8 +14,11 @@ tree-sitter init [OPTIONS] # Aliases: i
|
|||
|
||||
The following required files are always created if missing:
|
||||
|
||||
- `tree-sitter.json` - The main configuration file that determines how `tree-sitter` interacts with the grammar. If missing, the `init` command will prompt the user for the required fields. See [below](./init.md#structure-of-tree-sitterjson) for the full documentation of the structure of this file.
|
||||
- `package.json` - The `npm` manifest for the parser. This file is required for some `tree-sitter` subcommands, and if the grammar has dependencies (e.g., another published base grammar that this grammar extends).
|
||||
- `tree-sitter.json` - The main configuration file that determines how `tree-sitter` interacts with the grammar. If missing,
|
||||
the `init` command will prompt the user for the required fields. See [below](./init.md#structure-of-tree-sitterjson) for
|
||||
the full documentation of the structure of this file.
|
||||
- `package.json` - The `npm` manifest for the parser. This file is required for some `tree-sitter` subcommands, and if the
|
||||
grammar has dependencies (e.g., another published base grammar that this grammar extends).
|
||||
- `grammar.js` - An empty template for the main grammar file; see [the section on creating parsers](../2-creating-parser).
|
||||
|
||||
### Language bindings
|
||||
|
|
@ -130,8 +133,8 @@ be picked up by the cli.
|
|||
|
||||
These keys help to decide whether the language applies to a given file:
|
||||
|
||||
- `file-types` — An array of filename suffix strings (not including the dot). The grammar will be used for files whose names end with one of
|
||||
these suffixes. Note that the suffix may match an *entire* filename.
|
||||
- `file-types` — An array of filename suffix strings (not including the dot). The grammar will be used for files whose names
|
||||
end with one of these suffixes. Note that the suffix may match an *entire* filename.
|
||||
|
||||
- `first-line-regex` — A regex pattern that will be tested against the first line of a file
|
||||
to determine whether this language applies to the file. If present, this regex will be used for any file whose
|
||||
|
|
@ -188,7 +191,8 @@ Each key is a language name, and the value is a boolean.
|
|||
|
||||
Update outdated generated files, if possible.
|
||||
|
||||
**Note:** Existing files that may have been edited manually are _not_ updated in general. To force an update to such files, remove them and call `tree-sitter init -u` again.
|
||||
**Note:** Existing files that may have been edited manually are _not_ updated in general. To force an update to such files,
|
||||
remove them and call `tree-sitter init -u` again.
|
||||
|
||||
### `-p/--grammar-path <PATH>`
|
||||
|
||||
|
|
|
|||
|
|
@ -78,7 +78,8 @@ Suppress main output.
|
|||
|
||||
### `--edits <EDITS>...`
|
||||
|
||||
Apply edits after parsing the file. Edits are in the form of `row,col|position delcount insert_text` where row and col, or position are 0-indexed.
|
||||
Apply edits after parsing the file. Edits are in the form of `row,col|position delcount insert_text` where row and col,
|
||||
or position are 0-indexed.
|
||||
|
||||
### `--encoding <ENCODING>`
|
||||
|
||||
|
|
@ -95,7 +96,8 @@ Output parsing results in a JSON format.
|
|||
|
||||
### `--config-path <CONFIG_PATH>`
|
||||
|
||||
The path to an alternative configuration (`config.json`) file. See [the init-config command](./init-config.md) for more information.
|
||||
The path to an alternative configuration (`config.json`) file. See [the init-config command](./init-config.md) for more
|
||||
information.
|
||||
|
||||
### `-n/--test-number <TEST_NUMBER>`
|
||||
|
||||
|
|
|
|||
|
|
@ -7,8 +7,8 @@ tree-sitter playground [OPTIONS] # Aliases: play, pg, web-ui
|
|||
```
|
||||
|
||||
```admonish note
|
||||
For this to work, you must have already built the parser as a Wasm module. This can be done with the [`build`](./build.md) subcommand
|
||||
(`tree-sitter build --wasm`).
|
||||
For this to work, you must have already built the parser as a Wasm module. This can be done with the [`build`](./build.md)
|
||||
subcommand (`tree-sitter build --wasm`).
|
||||
```
|
||||
|
||||
## Options
|
||||
|
|
|
|||
|
|
@ -47,8 +47,8 @@ The range of rows in which the query will be executed. The format is `start_row:
|
|||
|
||||
### `--containing-row-range <ROW_RANGE>`
|
||||
|
||||
The range of rows in which the query will be executed. Only the matches that are fully contained within the provided row range
|
||||
will be returned.
|
||||
The range of rows in which the query will be executed. Only the matches that are fully contained within the provided row
|
||||
range will be returned.
|
||||
|
||||
### `--scope <SCOPE>`
|
||||
|
||||
|
|
@ -64,7 +64,8 @@ Whether to run query tests or not.
|
|||
|
||||
### `--config-path <CONFIG_PATH>`
|
||||
|
||||
The path to an alternative configuration (`config.json`) file. See [the init-config command](./init-config.md) for more information.
|
||||
The path to an alternative configuration (`config.json`) file. See [the init-config command](./init-config.md) for more
|
||||
information.
|
||||
|
||||
### `-n/--test-number <TEST_NUMBER>`
|
||||
|
||||
|
|
|
|||
|
|
@ -31,7 +31,8 @@ The path to the directory containing the grammar.
|
|||
|
||||
### `--config-path <CONFIG_PATH>`
|
||||
|
||||
The path to an alternative configuration (`config.json`) file. See [the init-config command](./init-config.md) for more information.
|
||||
The path to an alternative configuration (`config.json`) file. See [the init-config command](./init-config.md) for more
|
||||
information.
|
||||
|
||||
### `-n/--test-number <TEST_NUMBER>`
|
||||
|
||||
|
|
|
|||
|
|
@ -63,7 +63,8 @@ When using the `--debug-graph` option, open the log file in the default browser.
|
|||
|
||||
### `--config-path <CONFIG_PATH>`
|
||||
|
||||
The path to an alternative configuration (`config.json`) file. See [the init-config command](./init-config.md) for more information.
|
||||
The path to an alternative configuration (`config.json`) file. See [the init-config command](./init-config.md) for more
|
||||
information.
|
||||
|
||||
### `--show-fields`
|
||||
|
||||
|
|
|
|||
|
|
@ -25,11 +25,9 @@ tree-sitter version --bump minor # minor bump
|
|||
tree-sitter version --bump major # major bump
|
||||
```
|
||||
|
||||
As a grammar author, you should keep the version of your grammar in sync across
|
||||
different bindings. However, doing so manually is error-prone and tedious, so
|
||||
this command takes care of the burden. If you are using a version control system,
|
||||
it is recommended to commit the changes made by this command, and to tag the
|
||||
commit with the new version.
|
||||
As a grammar author, you should keep the version of your grammar in sync across different bindings. However, doing so manually
|
||||
is error-prone and tedious, so this command takes care of the burden. If you are using a version control system, it is recommended
|
||||
to commit the changes made by this command, and to tag the commit with the new version.
|
||||
|
||||
To print the current version without bumping it, use:
|
||||
|
||||
|
|
|
|||
|
|
@ -17,8 +17,8 @@ DSL through the `RustRegex` class. Simply pass your regex pattern as a string:
|
|||
```
|
||||
|
||||
Unlike JavaScript's builtin `RegExp` class, which takes a pattern and flags as separate arguments, `RustRegex` only
|
||||
accepts a single pattern string. While it doesn't support separate flags, you can use inline flags within the pattern itself.
|
||||
For more details about Rust's regex syntax and capabilities, check out the [Rust regex documentation][rust regex].
|
||||
accepts a single pattern string. While it doesn't support separate flags, you can use inline flags within the pattern
|
||||
itself. For more details about Rust's regex syntax and capabilities, check out the [Rust regex documentation][rust regex].
|
||||
|
||||
```admonish note
|
||||
Only a subset of the Regex engine is actually supported. This is due to certain features like lookahead and lookaround
|
||||
|
|
@ -50,10 +50,10 @@ The previous `repeat` rule is implemented in `repeat1` but is included because i
|
|||
- **Options : `optional(rule)`** — This function creates a rule that matches *zero or one* occurrence of a given rule.
|
||||
It is analogous to the `[x]` (square bracket) syntax in EBNF notation.
|
||||
|
||||
- **Precedence : `prec(number, rule)`** — This function marks the given rule with a numerical precedence, which will be used
|
||||
to resolve [*LR(1) Conflicts*][lr-conflict] at parser-generation time. When two rules overlap in a way that represents either
|
||||
a true ambiguity or a *local* ambiguity given one token of lookahead, Tree-sitter will try to resolve the conflict by matching
|
||||
the rule with the higher precedence. The default precedence of all rules is zero. This works similarly to the
|
||||
- **Precedence : `prec(number, rule)`** — This function marks the given rule with a numerical precedence, which will be
|
||||
used to resolve [*LR(1) Conflicts*][lr-conflict] at parser-generation time. When two rules overlap in a way that represents
|
||||
either a true ambiguity or a *local* ambiguity given one token of lookahead, Tree-sitter will try to resolve the conflict
|
||||
by matching the rule with the higher precedence. The default precedence of all rules is zero. This works similarly to the
|
||||
[precedence directives][yacc-prec] in Yacc grammars.
|
||||
|
||||
This function can also be used to assign lexical precedence to a given
|
||||
|
|
@ -115,8 +115,8 @@ want to create syntax tree nodes at runtime.
|
|||
|
||||
- **`conflicts`** — an array of arrays of rule names. Each inner array represents a set of rules that's involved in an
|
||||
*LR(1) conflict* that is *intended to exist* in the grammar. When these conflicts occur at runtime, Tree-sitter will use
|
||||
the GLR algorithm to explore all the possible interpretations. If *multiple* parses end up succeeding, Tree-sitter will pick
|
||||
the subtree whose corresponding rule has the highest total *dynamic precedence*.
|
||||
the GLR algorithm to explore all the possible interpretations. If *multiple* parses end up succeeding, Tree-sitter will
|
||||
pick the subtree whose corresponding rule has the highest total *dynamic precedence*.
|
||||
|
||||
- **`externals`** — an array of token names which can be returned by an
|
||||
[*external scanner*][external-scanners]. External scanners allow you to write custom C code which runs during the lexing
|
||||
|
|
@ -139,10 +139,10 @@ for more details.
|
|||
array of reserved rules. The reserved rule in the array must be a terminal token meaning it must be a string, regex, token,
|
||||
or terminal rule. The reserved rule must also exist and be used in the grammar, specifying arbitrary tokens will not work.
|
||||
The *first* reserved word set in the object is the global word set, meaning it applies to every rule in every parse state.
|
||||
However, certain keywords are contextual, depending on the rule. For example, in JavaScript, keywords are typically not allowed
|
||||
as ordinary variables, however, they *can* be used as a property name. In this situation, the `reserved` function would be used,
|
||||
and the word set to pass in would be the name of the word set that is declared in the `reserved` object that corresponds to an
|
||||
empty array, signifying *no* keywords are reserved.
|
||||
However, certain keywords are contextual, depending on the rule. For example, in JavaScript, keywords are typically not
|
||||
allowed as ordinary variables, however, they *can* be used as a property name. In this situation, the `reserved` function
|
||||
would be used, and the word set to pass in would be the name of the word set that is declared in the `reserved` object that
|
||||
corresponds to an empty array, signifying *no* keywords are reserved.
|
||||
|
||||
[bison-dprec]: https://www.gnu.org/software/bison/manual/html_node/Generalized-LR-Parsing.html
|
||||
[ebnf]: https://en.wikipedia.org/wiki/Extended_Backus%E2%80%93Naur_form
|
||||
|
|
|
|||
|
|
@ -1,7 +1,7 @@
|
|||
# Writing the Grammar
|
||||
|
||||
Writing a grammar requires creativity. There are an infinite number of CFGs (context-free grammars) that can be used to describe
|
||||
any given language. To produce a good Tree-sitter parser, you need to create a grammar with two important properties:
|
||||
Writing a grammar requires creativity. There are an infinite number of CFGs (context-free grammars) that can be used to
|
||||
describe any given language. To produce a good Tree-sitter parser, you need to create a grammar with two important properties:
|
||||
|
||||
1. **An intuitive structure** — Tree-sitter's output is a [concrete syntax tree][cst]; each node in the tree corresponds
|
||||
directly to a [terminal or non-terminal symbol][non-terminal] in the grammar. So to produce an easy-to-analyze tree, there
|
||||
|
|
@ -139,8 +139,8 @@ instead. It's often useful to check your progress by trying to parse some real c
|
|||
## Structuring Rules Well
|
||||
|
||||
Imagine that you were just starting work on the [Tree-sitter JavaScript parser][tree-sitter-javascript]. Naively, you might
|
||||
try to directly mirror the structure of the [ECMAScript Language Spec][ecmascript-spec]. To illustrate the problem with this
|
||||
approach, consider the following line of code:
|
||||
try to directly mirror the structure of the [ECMAScript Language Spec][ecmascript-spec]. To illustrate the problem with
|
||||
this approach, consider the following line of code:
|
||||
|
||||
```js
|
||||
return x + y;
|
||||
|
|
@ -181,16 +181,17 @@ which are unrelated to the actual code.
|
|||
|
||||
## Standard Rule Names
|
||||
|
||||
Tree-sitter places no restrictions on how to name the rules of your grammar. It can be helpful, however, to follow certain conventions
|
||||
used by many other established grammars in the ecosystem. Some of these well-established patterns are listed below:
|
||||
Tree-sitter places no restrictions on how to name the rules of your grammar. It can be helpful, however, to follow certain
|
||||
conventions used by many other established grammars in the ecosystem. Some of these well-established patterns are listed
|
||||
below:
|
||||
|
||||
- `source_file`: Represents an entire source file, this rule is commonly used as the root node for a grammar,
|
||||
- `expression`/`statement`: Used to represent statements and expressions for a given language. Commonly defined as a choice between several
|
||||
more specific sub-expression/sub-statement rules.
|
||||
- `expression`/`statement`: Used to represent statements and expressions for a given language. Commonly defined as a choice
|
||||
between several more specific sub-expression/sub-statement rules.
|
||||
- `block`: Used as the parent node for block scopes, with its children representing the block's contents.
|
||||
- `type`: Represents the types of a language such as `int`, `char`, and `void`.
|
||||
- `identifier`: Used for constructs like variable names, function arguments, and object fields; this rule is commonly used as the `word`
|
||||
token in grammars.
|
||||
- `identifier`: Used for constructs like variable names, function arguments, and object fields; this rule is commonly used
|
||||
as the `word` token in grammars.
|
||||
- `string`: Used to represent `"string literals"`.
|
||||
- `comment`: Used to represent comments, this rule is commonly used as an `extra`.
|
||||
|
||||
|
|
@ -308,9 +309,9 @@ This is where `prec.left` and `prec.right` come into use. We want to select the
|
|||
|
||||
## Using Conflicts
|
||||
|
||||
Sometimes, conflicts are actually desirable. In our JavaScript grammar, expressions and patterns can create intentional ambiguity.
|
||||
A construct like `[x, y]` could be legitimately parsed as both an array literal (like in `let a = [x, y]`) or as a destructuring
|
||||
pattern (like in `let [x, y] = arr`).
|
||||
Sometimes, conflicts are actually desirable. In our JavaScript grammar, expressions and patterns can create intentional
|
||||
ambiguity. A construct like `[x, y]` could be legitimately parsed as both an array literal (like in `let a = [x, y]`) or
|
||||
as a destructuring pattern (like in `let [x, y] = arr`).
|
||||
|
||||
```js
|
||||
export default grammar({
|
||||
|
|
@ -564,8 +565,8 @@ as mentioned in the previous page, is `token(prec(N, ...))`.
|
|||
## Keywords
|
||||
|
||||
Many languages have a set of _keyword_ tokens (e.g. `if`, `for`, `return`), as well as a more general token (e.g. `identifier`)
|
||||
that matches any word, including many of the keyword strings. For example, JavaScript has a keyword `instanceof`, which is
|
||||
used as a binary operator, like this:
|
||||
that matches any word, including many of the keyword strings. For example, JavaScript has a keyword `instanceof`, which
|
||||
is used as a binary operator, like this:
|
||||
|
||||
```js
|
||||
if (a instanceof Something) b();
|
||||
|
|
|
|||
|
|
@ -143,10 +143,10 @@ the second argument, the current character will be treated as whitespace; whites
|
|||
associated with tokens emitted by the external scanner.
|
||||
|
||||
- **`void (*mark_end)(TSLexer *)`** — A function for marking the end of the recognized token. This allows matching tokens
|
||||
that require multiple characters of lookahead. By default, (if you don't call `mark_end`), any character that you moved past
|
||||
using the `advance` function will be included in the size of the token. But once you call `mark_end`, then any later calls
|
||||
to `advance` will _not_ increase the size of the returned token. You can call `mark_end` multiple times to increase the size
|
||||
of the token.
|
||||
that require multiple characters of lookahead. By default, (if you don't call `mark_end`), any character that you moved
|
||||
past using the `advance` function will be included in the size of the token. But once you call `mark_end`, then any later
|
||||
calls to `advance` will _not_ increase the size of the returned token. You can call `mark_end` multiple times to increase
|
||||
the size of the token.
|
||||
|
||||
- **`uint32_t (*get_column)(TSLexer *)`** — A function for querying the current column position of the lexer. It returns
|
||||
the number of codepoints since the start of the current line. The codepoint position is recalculated on every call to this
|
||||
|
|
@ -185,9 +185,9 @@ if (valid_symbols[INDENT] || valid_symbols[DEDENT]) {
|
|||
|
||||
### Allocator
|
||||
|
||||
Instead of using libc's `malloc`, `calloc`, `realloc`, and `free`, you should use the versions prefixed with `ts_` from `tree_sitter/alloc.h`.
|
||||
These macros can allow a potential consumer to override the default allocator with their own implementation, but by default
|
||||
will use the libc functions.
|
||||
Instead of using libc's `malloc`, `calloc`, `realloc`, and `free`, you should use the versions prefixed with `ts_` from
|
||||
`tree_sitter/alloc.h`. These macros can allow a potential consumer to override the default allocator with their own implementation,
|
||||
but by default will use the libc functions.
|
||||
|
||||
As a consumer of the tree-sitter core library as well as any parser libraries that might use allocations, you can enable
|
||||
overriding the default allocator and have it use the same one as the library allocator, of which you can set with `ts_set_allocator`.
|
||||
|
|
@ -195,7 +195,8 @@ To enable this overriding in scanners, you must compile them with the `TREE_SITT
|
|||
the library must be linked into your final app dynamically, since it needs to resolve the internal functions at runtime.
|
||||
If you are compiling an executable binary that uses the core library, but want to load parsers dynamically at runtime, then
|
||||
you will have to use a special linker flag on Unix. For non-Darwin systems, that would be `--dynamic-list` and for Darwin
|
||||
systems, that would be `-exported_symbols_list`. The CLI does exactly this, so you can use it as a reference (check out `cli/build.rs`).
|
||||
systems, that would be `-exported_symbols_list`. The CLI does exactly this, so you can use it as a reference (check out
|
||||
`cli/build.rs`).
|
||||
|
||||
For example, assuming you wanted to allocate 100 bytes for your scanner, you'd do so like the following example:
|
||||
|
||||
|
|
@ -293,9 +294,10 @@ bool tree_sitter_my_language_external_scanner_scan(
|
|||
|
||||
## Other External Scanner Details
|
||||
|
||||
External scanners have priority over Tree-sitter's normal lexing process. When a token listed in the externals array is valid
|
||||
at a given position, the external scanner is called first. This makes external scanners a powerful way to override Tree-sitter's
|
||||
default lexing behavior, especially for cases that can't be handled with regular lexical rules, parsing, or dynamic precedence.
|
||||
External scanners have priority over Tree-sitter's normal lexing process. When a token listed in the externals array is
|
||||
valid at a given position, the external scanner is called first. This makes external scanners a powerful way to override
|
||||
Tree-sitter's default lexing behavior, especially for cases that can't be handled with regular lexical rules, parsing, or
|
||||
dynamic precedence.
|
||||
|
||||
During error recovery, Tree-sitter's first step is to call the external scanner's scan function with all tokens marked as
|
||||
valid. Your scanner should detect and handle this case appropriately. One simple approach is to add an unused "sentinel"
|
||||
|
|
|
|||
|
|
@ -39,8 +39,8 @@ It only shows the *named* nodes, as described in [this section][named-vs-anonymo
|
|||
```
|
||||
|
||||
The expected output section can also *optionally* show the [*field names*][node-field-names] associated with each child
|
||||
node. To include field names in your tests, you write a node's field name followed by a colon, before the node itself in
|
||||
the S-expression:
|
||||
node. To include field names in your tests, you write a node's field name followed by a colon, before the node itself
|
||||
in the S-expression:
|
||||
|
||||
```query
|
||||
(source_file
|
||||
|
|
@ -104,8 +104,8 @@ you can repeat the attribute on a new line.
|
|||
|
||||
The following attributes are available:
|
||||
|
||||
* `:cst` - This attribute specifies that the expected output should be in the form of a CST instead of the normal S-expression. This
|
||||
CST matches the format given by `parse --cst`.
|
||||
* `:cst` - This attribute specifies that the expected output should be in the form of a CST instead of the normal S-expression.
|
||||
This CST matches the format given by `parse --cst`.
|
||||
* `:error` — This attribute will assert that the parse tree contains an error. It's useful to just validate that a certain
|
||||
input is invalid without displaying the whole parse tree, as such you should omit the parse tree below the `---` line.
|
||||
* `:fail-fast` — This attribute will stop the testing of additional cases if the test marked with this attribute fails.
|
||||
|
|
|
|||
|
|
@ -1,4 +1,4 @@
|
|||
# Creating parsers
|
||||
|
||||
Developing Tree-sitter grammars can have a difficult learning curve, but once you get the hang of it, it can be fun and even
|
||||
zen-like. This document will help you to get started and to develop a useful mental model.
|
||||
Developing Tree-sitter grammars can have a difficult learning curve, but once you get the hang of it, it can be fun and
|
||||
even zen-like. This document will help you to get started and to develop a useful mental model.
|
||||
|
|
|
|||
|
|
@ -10,7 +10,8 @@ file and efficiently update the syntax tree as the source file is edited. Tree-s
|
|||
- **General** enough to parse any programming language
|
||||
- **Fast** enough to parse on every keystroke in a text editor
|
||||
- **Robust** enough to provide useful results even in the presence of syntax errors
|
||||
- **Dependency-free** so that the runtime library (which is written in pure [C11](https://github.com/tree-sitter/tree-sitter/tree/master/lib)) can be embedded in any application
|
||||
- **Dependency-free** so that the runtime library (which is written in pure [C11](https://github.com/tree-sitter/tree-sitter/tree/master/lib))
|
||||
can be embedded in any application
|
||||
|
||||
## Language Bindings
|
||||
|
||||
|
|
|
|||
|
|
@ -2,7 +2,8 @@
|
|||
|
||||
## Providing the Code
|
||||
|
||||
In the example on the previous page, we parsed source code stored in a simple string using the `ts_parser_parse_string` function:
|
||||
In the example on the previous page, we parsed source code stored in a simple string using the `ts_parser_parse_string`
|
||||
function:
|
||||
|
||||
```c
|
||||
TSTree *ts_parser_parse_string(
|
||||
|
|
@ -135,10 +136,10 @@ Consider a grammar rule like this:
|
|||
if_statement: $ => seq("if", "(", $._expression, ")", $._statement);
|
||||
```
|
||||
|
||||
A syntax node representing an `if_statement` in this language would have 5 children: the condition expression, the body statement,
|
||||
as well as the `if`, `(`, and `)` tokens. The expression and the statement would be marked as _named_ nodes, because they
|
||||
have been given explicit names in the grammar. But the `if`, `(`, and `)` nodes would _not_ be named nodes, because they
|
||||
are represented in the grammar as simple strings.
|
||||
A syntax node representing an `if_statement` in this language would have 5 children: the condition expression, the body
|
||||
statement, as well as the `if`, `(`, and `)` tokens. The expression and the statement would be marked as _named_ nodes,
|
||||
because they have been given explicit names in the grammar. But the `if`, `(`, and `)` nodes would _not_ be named nodes,
|
||||
because they are represented in the grammar as simple strings.
|
||||
|
||||
You can check whether any given node is named:
|
||||
|
||||
|
|
|
|||
|
|
@ -19,8 +19,8 @@ typedef struct {
|
|||
void ts_tree_edit(TSTree *, const TSInputEdit *);
|
||||
```
|
||||
|
||||
Then, you can call `ts_parser_parse` again, passing in the old tree. This will create a new tree that internally shares structure
|
||||
with the old tree.
|
||||
Then, you can call `ts_parser_parse` again, passing in the old tree. This will create a new tree that internally shares
|
||||
structure with the old tree.
|
||||
|
||||
When you edit a syntax tree, the positions of its nodes will change. If you have stored any `TSNode` instances outside of
|
||||
the `TSTree`, you must update their positions separately, using the same `TSInputEdit` value, in order to update their
|
||||
|
|
|
|||
|
|
@ -108,9 +108,9 @@ In Tree-sitter grammars, there are usually certain rules that represent abstract
|
|||
"type", "declaration"). In the `grammar.js` file, these are often written as [hidden rules][hidden rules]
|
||||
whose definition is a simple [`choice`][grammar dsl] where each member is just a single symbol.
|
||||
|
||||
Normally, hidden rules are not mentioned in the node types file, since they don't appear in the syntax tree. But if you add
|
||||
a hidden rule to the grammar's [`supertypes` list][grammar dsl], then it _will_ show up in the node
|
||||
types file, with the following special entry:
|
||||
Normally, hidden rules are not mentioned in the node types file, since they don't appear in the syntax tree. But if you
|
||||
add a hidden rule to the grammar's [`supertypes` list][grammar dsl], then it _will_ show up in the node types file, with
|
||||
the following special entry:
|
||||
|
||||
- `"subtypes"` — An array of objects that specify the _types_ of nodes that this 'supertype' node can wrap.
|
||||
|
||||
|
|
|
|||
|
|
@ -15,8 +15,11 @@ A given version of the tree-sitter library is only able to load parsers generate
|
|||
| >=0.20.3, <=0.24 | 13 | 14 |
|
||||
| >=0.25 | 13 | 15 |
|
||||
|
||||
By default, the tree-sitter CLI will generate parsers using the latest available ABI for that version, but an older ABI (supported by the CLI) can be selected by passing the [`--abi` option][abi_option] to the `generate` command.
|
||||
By default, the tree-sitter CLI will generate parsers using the latest available ABI for that version, but an older ABI
|
||||
(supported by the CLI) can be selected by passing the [`--abi` option][abi_option] to the `generate` command.
|
||||
|
||||
Note that the ABI version range supported by the CLI can be smaller than for the library: When a new ABI version is released, older versions will be phased out over a deprecation period, which starts with no longer being able to generate parsers with the oldest ABI version.
|
||||
Note that the ABI version range supported by the CLI can be smaller than for the library: When a new ABI version is released,
|
||||
older versions will be phased out over a deprecation period, which starts with no longer being able to generate parsers
|
||||
with the oldest ABI version.
|
||||
|
||||
[abi_option]: ../cli/generate.md#--abi-version
|
||||
|
|
|
|||
|
|
@ -6,8 +6,8 @@ the core concepts remain the same.
|
|||
|
||||
Tree-sitter's parsing functionality is implemented through its C API, with all functions documented in the [tree_sitter/api.h][api.h]
|
||||
header file, but if you're working in another language, you can use one of the following bindings found [here](../index.md#language-bindings),
|
||||
each providing idiomatic access to Tree-sitter's functionality. Of these bindings, the official ones have their own API docs
|
||||
hosted online at the following pages:
|
||||
each providing idiomatic access to Tree-sitter's functionality. Of these bindings, the official ones have their own API
|
||||
doc hosted online at the following pages:
|
||||
|
||||
- [Go][go]
|
||||
- [Java]
|
||||
|
|
|
|||
|
|
@ -1,9 +1,9 @@
|
|||
# Query Syntax
|
||||
|
||||
A _query_ consists of one or more _patterns_, where each pattern is an [S-expression][s-exp] that matches a certain set of
|
||||
nodes in a syntax tree. The expression to match a given node consists of a pair of parentheses containing two things: the
|
||||
node's type, and optionally, a series of other S-expressions that match the node's children. For example, this pattern would
|
||||
match any `binary_expression` node whose children are both `number_literal` nodes:
|
||||
A _query_ consists of one or more _patterns_, where each pattern is an [S-expression][s-exp] that matches a certain set
|
||||
of nodes in a syntax tree. The expression to match a given node consists of a pair of parentheses containing two things:
|
||||
the node's type, and optionally, a series of other S-expressions that match the node's children. For example, this pattern
|
||||
would match any `binary_expression` node whose children are both `number_literal` nodes:
|
||||
|
||||
```query
|
||||
(binary_expression (number_literal) (number_literal))
|
||||
|
|
@ -99,10 +99,10 @@ by `(ERROR)` queries. Specific missing node types can also be queried:
|
|||
### Supertype Nodes
|
||||
|
||||
Some node types are marked as _supertypes_ in a grammar. A supertype is a node type that contains multiple
|
||||
subtypes. For example, in the [JavaScript grammar example][grammar], `expression` is a supertype that can represent any kind
|
||||
of expression, such as a `binary_expression`, `call_expression`, or `identifier`. You can use supertypes in queries to match
|
||||
any of their subtypes, rather than having to list out each subtype individually. For example, this pattern would match any
|
||||
kind of expression, even though it's not a visible node in the syntax tree:
|
||||
subtypes. For example, in the [JavaScript grammar example][grammar], `expression` is a supertype that can represent any
|
||||
kind of expression, such as a `binary_expression`, `call_expression`, or `identifier`. You can use supertypes in queries
|
||||
to match any of their subtypes, rather than having to list out each subtype individually. For example, this pattern would
|
||||
match any kind of expression, even though it's not a visible node in the syntax tree:
|
||||
|
||||
```query
|
||||
(expression) @any-expression
|
||||
|
|
|
|||
|
|
@ -128,15 +128,15 @@ This pattern would match any builtin variable that is not a local variable, beca
|
|||
|
||||
# Directives
|
||||
|
||||
Similar to predicates, directives are a way to associate arbitrary metadata with a pattern. The only difference between predicates
|
||||
and directives is that directives end in a `!` character instead of `?` character.
|
||||
Similar to predicates, directives are a way to associate arbitrary metadata with a pattern. The only difference between
|
||||
predicates and directives is that directives end in a `!` character instead of `?` character.
|
||||
|
||||
Tree-sitter's CLI supports the following directives by default:
|
||||
|
||||
## The `set!` directive
|
||||
|
||||
This directive allows you to associate key-value pairs with a pattern. The key and value can be any arbitrary text that you
|
||||
see fit.
|
||||
This directive allows you to associate key-value pairs with a pattern. The key and value can be any arbitrary text that
|
||||
you see fit.
|
||||
|
||||
```query
|
||||
((comment) @injection.content
|
||||
|
|
@ -156,8 +156,8 @@ another capture are preserved. It takes two arguments, both of which are capture
|
|||
### The `#strip!` directive
|
||||
|
||||
The `#strip!` directive allows you to remove text from a capture. It takes two arguments: the first is the capture to strip
|
||||
text from, and the second is a regular expression to match against the text. Any text matched by the regular expression will
|
||||
be removed from the text associated with the capture.
|
||||
text from, and the second is a regular expression to match against the text. Any text matched by the regular expression
|
||||
will be removed from the text associated with the capture.
|
||||
|
||||
For an example on the `#select-adjacent!` and `#strip!` directives,
|
||||
view the [code navigation](../../4-code-navigation.md#examples) documentation.
|
||||
|
|
|
|||
Loading…
Add table
Add a link
Reference in a new issue