tree-sitter/docs/src/5-implementation.md
2024-12-24 20:54:59 -05:00

60 lines
3.5 KiB
Markdown

# Implementation
Tree-sitter consists of two components: a C library (`libtree-sitter`), and a command-line tool (the `tree-sitter` CLI).
The library, `libtree-sitter`, is used in combination with the parsers
generated by the CLI, to produce syntax trees from source code and keep the
syntax trees up-to-date as the source code changes. `libtree-sitter` is designed to be embedded in applications. It is
written in plain C. Its interface is specified in the header file [`tree_sitter/api.h`][api.h].
The CLI is used to generate a parser for a language by supplying a [context-free grammar][cfg] describing the
language. The CLI is a build tool; it is no longer needed once a parser has been generated. It is written in Rust, and is
available on [crates.io][crates], [npm][npm], and as a pre-built binary [on GitHub][gh].
## The CLI
The `tree-sitter` CLI's most important feature is the `generate` command. This subcommand reads in a context-free grammar
from a file called `grammar.js` and outputs a parser as a C file called `parser.c`. The source files in the [`cli/src`][src]
directory all play a role in producing the code in `parser.c`. This section will describe some key parts of this process.
### Parsing a Grammar
First, Tree-sitter must evaluate the JavaScript code in `grammar.js` and convert the grammar to a JSON format. It does this
by shelling out to `node`. The format of the grammars is formally specified by the JSON schema in [grammar.schema.json][schema].
The parsing is implemented in [parse_grammar.rs][parse grammar].
### Grammar Rules
A Tree-sitter grammar is composed of a set of *rules* — objects that describe how syntax nodes can be composed of other
syntax nodes. There are several types of rules: symbols, strings, regexes, sequences, choices, repetitions, and a few others.
Internally, these are all represented using an [enum][enum] called [`Rule`][rules.rs].
### Preparing a Grammar
Once a grammar has been parsed, it must be transformed in several ways before it can be used to generate a parser. Each
transformation is implemented by a separate file in the [`prepare_grammar`][prepare grammar] directory, and the transformations
are ultimately composed together in `prepare_grammar/mod.rs`.
At the end of these transformations, the initial grammar is split into two grammars: a *syntax grammar* and a *lexical grammar*.
The syntax grammar describes how the language's [*non-terminal symbols*][symbols] are constructed from other grammar symbols,
and the lexical grammar describes how the grammar's *terminal symbols* (strings and regexes) can be
composed of individual characters.
### Building Parse Tables
## The Runtime
WIP
[api.h]: https://github.com/tree-sitter/tree-sitter/blob/master/lib/include/tree_sitter/api.h
[cfg]: https://en.wikipedia.org/wiki/Context-free_grammar
[crates]: https://crates.io
[npm]: https://npmjs.com
[gh]: https://github.com/tree-sitter/tree-sitter/releases/latest
[src]: https://github.com/tree-sitter/tree-sitter/tree/master/cli/src
[schema]: https://tree-sitter.github.io/tree-sitter/assets/schemas/grammar.schema.json
[parse grammar]: https://github.com/tree-sitter/tree-sitter/blob/master/cli/generate/src/parse_grammar.rs
[enum]: https://doc.rust-lang.org/book/ch06-01-defining-an-enum.html
[rules.rs]: https://github.com/tree-sitter/tree-sitter/blob/master/cli/generate/src/rules.rs
[prepare grammar]: https://github.com/tree-sitter/tree-sitter/tree/master/cli/generate/src/prepare_grammar
[symbols]: https://en.wikipedia.org/wiki/Terminal_and_nonterminal_symbols