40 lines
3.3 KiB
Markdown
40 lines
3.3 KiB
Markdown
---
|
|
title: Implementation
|
|
permalink: implementation
|
|
---
|
|
|
|
# Implementation
|
|
|
|
Tree-sitter consists of two components: a C library (`libtree-sitter`), and a command-line tool (the `tree-sitter` CLI).
|
|
|
|
The library, `libtree-sitter`, is used in combination with the parsers
|
|
generated by the CLI, to produce syntax trees from source code and keep the
|
|
syntax trees up-to-date as the source code changes. `libtree-sitter` is designed to be embedded in applications. It is written in plain C. Its interface is specified in the header file [`tree_sitter/api.h`](https://github.com/tree-sitter/tree-sitter/blob/master/lib/include/tree_sitter/api.h).
|
|
|
|
The CLI is
|
|
used to generate a parser for a language by supplying a [context-free grammar](https://en.wikipedia.org/wiki/Context-free_grammar) describing the
|
|
language. The CLI is a build tool; it is no longer needed once a parser has been generated. It is written in Rust, and is available on [crates.io](https://crates.io), [npm](https://npmjs.com), and as a pre-built binary [on GitHub](https://github.com/tree-sitter/tree-sitter/releases/latest).
|
|
|
|
## The CLI
|
|
|
|
The `tree-sitter` CLI's most important feature is the `generate` subcommand. This subcommand reads context-free grammar from a file called `grammar.js` and outputs a parser as a C file called `parser.c`. The source files in the [`cli/src`](https://github.com/tree-sitter/tree-sitter/tree/master/cli/src) directory all play a role in producing the code in `parser.c`. This section will describe some key parts of this process.
|
|
|
|
### Parsing a Grammar
|
|
|
|
First, Tree-sitter must evaluate the JavaScript code in `grammar.js` and convert the grammar to a JSON format. It does this by shelling out to `node`. The format of the grammars is formally specified by the JSON schema in [grammar.schema.json](https://tree-sitter.github.io/tree-sitter/assets/schemas/grammar.schema.json). The parsing is implemented in [parse_grammar.rs](https://github.com/tree-sitter/tree-sitter/blob/master/cli/generate/src/parse_grammar.rs).
|
|
|
|
### Grammar Rules
|
|
|
|
A Tree-sitter grammar is composed of a set of *rules* - objects that describe how syntax nodes can be composed from other syntax nodes. There are several types of rules: symbols, strings, regexes, sequences, choices, repetitions, and a few others. Internally, these are all represented using an [enum](https://doc.rust-lang.org/book/ch06-01-defining-an-enum.html) called [`Rule`](https://github.com/tree-sitter/tree-sitter/blob/master/cli/generate/src/rules.rs).
|
|
|
|
### Preparing a Grammar
|
|
|
|
Once a grammar has been parsed, it must be transformed in several ways before it can be used to generate a parser. Each transformation is implemented by a separate file in the [`prepare_grammar`](https://github.com/tree-sitter/tree-sitter/tree/master/cli/generate/src/prepare_grammar) directory, and the transformations are ultimately composed together in `prepare_grammar/mod.rs`.
|
|
|
|
At the end of these transformations, the initial grammar is split into two grammars: a *syntax grammar* and a *lexical grammar*. The syntax grammar describes how the language's [*non-terminal symbols*](https://en.wikipedia.org/wiki/Terminal_and_nonterminal_symbols) are constructed from other grammar symbols, and the lexical grammar describes how the grammar's *terminal symbols* (strings and regexes) can be composed from individual characters.
|
|
|
|
### Building Parse Tables
|
|
|
|
## The Runtime
|
|
|
|
WIP
|