174 lines
6 KiB
Markdown
174 lines
6 KiB
Markdown
# tree-sitter
|
|
|
|
[](https://travis-ci.org/maxbrunsfeld/tree-sitter)
|
|
|
|
Tree-sitter is an incremental parsing library in C and C++, intended to be used via [bindings](https://github.com/maxbrunsfeld/node-tree-sitter) to higher-level
|
|
languages. It allows documents to be efficiently re-parsed after localized
|
|
edits, making it suitable for use in performance-intensive text-editing programs.
|
|
|
|
Tree-sitter uses a sentential-form incremental [LR parsing](https://en.wikipedia.org/wiki/LR_parser)
|
|
algorithm, as described in the paper *[Efficient and Flexible Incremental Parsing](http://harmonia.cs.berkeley.edu/papers/twagner-parsing.ps.gz)*
|
|
by Tim Wagner. It handles ambiguity at compile-time via [precedence annotations](https://en.wikipedia.org/wiki/Operator-precedence_parser),
|
|
and at run-time via the [GLR algorithm](https://en.wikipedia.org/wiki/GLR_parser).
|
|
This allows it to generate a fast parser for any context-free grammar.
|
|
|
|
### Installation
|
|
|
|
```sh
|
|
script/configure.sh # Generate a Makefile using gyp
|
|
make # Build static libraries for the compiler and runtime
|
|
```
|
|
|
|
### Writing a grammar
|
|
|
|
Tree-sitter's interface for creating grammars is a C++ library, `libcompiler`.
|
|
This allows grammars and rules to be defined, manipulated and
|
|
extended as simple values in high-level languages like [javascript](https://github.com/maxbrunsfeld/node-tree-sitter-compiler),
|
|
and then converted into tree-sitter's native representation and compiled to C
|
|
parsers. These parsers can then be used from any language that has a binding to
|
|
tree-sitter's runtime library, `libruntime`.
|
|
|
|
Here's a simple example that uses `libcompiler` directly:
|
|
|
|
```cpp
|
|
// arithmetic_grammar.cc
|
|
|
|
#include <assert.h>
|
|
#include <stdio.h>
|
|
#include "tree_sitter/compiler.h"
|
|
|
|
using namespace tree_sitter;
|
|
|
|
int main() {
|
|
auto arithmetic_grammar = Grammar({
|
|
|
|
// The first rule listed in a grammar becomes the 'start rule'.
|
|
{ "expression", choice({
|
|
sym("sum"),
|
|
sym("product"),
|
|
sym("number"),
|
|
sym("variable"),
|
|
|
|
// Error recovery is controlled by wrapping rule subtrees with `err`.
|
|
seq({
|
|
str("("),
|
|
err(sym("expression")),
|
|
str(")") }) }) },
|
|
|
|
// Tokens like '+' and '*' are described directly within the grammar's rules,
|
|
// as opposed to in a seperate lexer description.
|
|
{ "sum", prec_left(1, seq({
|
|
sym("expression"),
|
|
str("+"),
|
|
sym("expression") })) },
|
|
|
|
// Ambiguities can be resolved at compile time by assigning precedence
|
|
// values to rule subtrees.
|
|
{ "product", prec_left(2, seq({
|
|
sym("expression"),
|
|
str("*"),
|
|
sym("expression") })) },
|
|
|
|
// Tokens can be specified using ECMAScript regexps.
|
|
{ "number", pattern("\\d+") },
|
|
{ "variable", pattern("[a-zA-Z]+\\w*") },
|
|
{ "comment", pattern("//.*") },
|
|
|
|
}).extra_tokens({
|
|
|
|
// Things that can appear anywhere in the language are expressed as
|
|
// 'extra tokens'.
|
|
sym("comment"),
|
|
pattern("\\s+")
|
|
});
|
|
|
|
// Generate C code for parsing this language.
|
|
auto output = compile(arithmetic_grammar, "arithmetic");
|
|
std::string c_code = output.first;
|
|
const GrammarError *error = output.second;
|
|
|
|
assert(!error);
|
|
puts(c_code.c_str());
|
|
|
|
return 0;
|
|
}
|
|
```
|
|
|
|
To create a parser for this language, compile and run this grammar like this:
|
|
|
|
```sh
|
|
clang++ -stdlib=libc++ -std=c++11 \
|
|
-I tree-sitter/include -L tree-sitter/out/Debug -l compiler \
|
|
arithmetic_grammar.cc -o arithmetic_grammar
|
|
|
|
./arithmetic_grammar > arithmetic_parser.c
|
|
```
|
|
|
|
### Using the parser
|
|
|
|
The `tree_sitter/runtime` C library exposes a DOM-style interface for inspecting
|
|
documents.
|
|
|
|
Functions like `ts_node_child(node, index)` and `ts_node_next_sibling(node)`
|
|
expose every node in the concrete syntax tree. This is useful for operations
|
|
like syntax-highlighting, that operate on a token-by-token basis. You can also
|
|
traverse the tree in a more abstract way by using functions like
|
|
`ts_node_named_child(node, index)` and `ts_node_next_named_sibling(node)`. These
|
|
functions don't expose nodes that were specified in the grammar as anonymous
|
|
tokens, like `(` and `+`. This is useful when analyzing the meaning of a document.
|
|
|
|
```c
|
|
#include <stdio.h>
|
|
#include "tree_sitter/runtime.h"
|
|
|
|
// Declare the language constructor that was generated from your grammar.
|
|
TSLanguage *ts_language_arithmetic();
|
|
|
|
int main() {
|
|
TSDocument *document = ts_document_make();
|
|
ts_document_set_language(document, ts_language_arithmetic());
|
|
|
|
// Usually, you would use the more general `ts_document_set_input`, which
|
|
// takes a struct with function pointers for seeking to positions in the text,
|
|
// and reading chunks of text. This allows you to efficiently parse text
|
|
// stored in your own data structure.
|
|
ts_document_set_input_string(document, "a + b * 5");
|
|
ts_document_parse(document);
|
|
|
|
TSNode root_node = ts_document_root_node(document);
|
|
printf(
|
|
"Root name: %s, start: %lu, end: %lu\n",
|
|
ts_node_name(root_node, document),
|
|
ts_node_start_char(root_node),
|
|
ts_node_end_char(root_node)
|
|
);
|
|
|
|
TSNode product_node = ts_node_named_child(ts_node_child(root_node, 0), 1);
|
|
printf(
|
|
"Child name: %s, start: %lu, end: %lu\n",
|
|
ts_node_name(product_node, document),
|
|
ts_node_start_char(product_node),
|
|
ts_node_end_char(product_node)
|
|
);
|
|
|
|
ts_document_free(document);
|
|
return 0;
|
|
}
|
|
```
|
|
|
|
To demo this parser's capabilities, compile this program like this:
|
|
|
|
```sh
|
|
clang \
|
|
-I tree-sitter/include -L tree-sitter/out/Debug -l runtime \
|
|
arithmetic_parser.c test_parser.c -o test_parser
|
|
|
|
./test_parser
|
|
```
|
|
|
|
### References
|
|
|
|
- [Context Aware Scanning for Parsing Extensible Languages](http://www.umsec.umn.edu/publications/Context-Aware-Scanning-Parsing-Extensible)
|
|
- [LR(1) Parser Generation System](http://arxiv.org/pdf/1010.1234.pdf) (Error recovery scheme)
|
|
- [Efficient and Flexible Incremental Parsing](http://harmonia.cs.berkeley.edu/papers/twagner-parsing.ps.gz)
|
|
- [Incremental Analysis of Real Programming Languages](http://harmonia.cs.berkeley.edu/papers/twagner-glr.pdf)
|