An incremental parsing system for programming tools https://tree-sitter.github.io

c incremental parser parsing rust tree-sitter wasm

Find a file

Max Brunsfeld 4b04afac5e Control lexer's error-mode via explicit boolean argument Previously, the lexer would operate in error-mode (ignoring any garbage input until it found a valid token) if it was invoked in the 'error' state. Now that the error state is deduped with other lexical states, the lexer might be invoked in that state even when error-mode is not intended. This adds a third argument to `ts_lex` that explicitly sets the error-mode. This bug was unlikely to occur in any real grammars, but it caused the node-tree-sitter-compiler test suite to fail for some grammars with only one rule.		2015-12-30 09:43:12 -08:00
externals	Replace vendored copy of utf8proc w/ submodule reference	2015-12-27 13:26:56 -08:00
include/tree_sitter	Control lexer's error-mode via explicit boolean argument	2015-12-30 09:43:12 -08:00
script	In specs, seed random number generator with clock time by default	2015-12-10 21:01:27 -08:00
spec	Control lexer's error-mode via explicit boolean argument	2015-12-30 09:43:12 -08:00
src	Control lexer's error-mode via explicit boolean argument	2015-12-30 09:43:12 -08:00
.clang-format	Auto-format: no single-line functions	2015-07-31 16:32:24 -07:00
.gitattributes	Don't count generated parsers in github language stats	2015-11-10 16:56:53 -08:00
.gitignore	Add vim YCM config file	2014-06-23 12:47:12 -07:00
.gitmodules	Replace vendored copy of utf8proc w/ submodule reference	2015-12-27 13:26:56 -08:00
.travis.yml	Use docker infrastructure on travis-ci	2015-04-17 11:27:40 -07:00
project.gyp	Support UTF16 directly	2015-12-28 13:53:22 -08:00
README.md	Rename ubiquitous_tokens -> extra_tokens in compiler API	2015-12-17 15:50:50 -08:00
tests.gyp	Allow building specs w/ Release configuration	2015-10-27 09:58:27 -07:00
todo.md	Remove completed todo items	2015-10-06 16:40:37 -07:00

README.md

tree-sitter

Tree-sitter is an incremental parsing library in C and C++, intended to be used via bindings to higher-level languages. It allows documents to be efficiently re-parsed after localized edits, making it suitable for use in performance-intensive text-editing programs.

Tree-sitter uses a sentential-form incremental LR parsing algorithm, as described in the paper Efficient and Flexible Incremental Parsing by Tim Wagner. It handles ambiguity at compile-time via precedence annotations, and at run-time via the GLR algorithm. This allows it to generate a fast parser for any context-free grammar.

Installation

script/configure.sh  # Generate a Makefile using gyp
make                 # Build static libraries for the compiler and runtime

Writing a grammar

Tree-sitter's interface for creating grammars is a C++ library, libcompiler. This allows grammars and rules to be defined, manipulated and extended as simple values in high-level languages like javascript, and then converted into tree-sitter's native representation and compiled to C parsers. These parsers can then be used from any language that has a binding to tree-sitter's runtime library, libruntime.

Here's a simple example that uses libcompiler directly:

// arithmetic_grammar.cc

#include <assert.h>
#include <stdio.h>
#include "tree_sitter/compiler.h"

using namespace tree_sitter;

int main() {
  auto arithmetic_grammar = Grammar({

    // The first rule listed in a grammar becomes the 'start rule'.
    { "expression", choice({
      sym("sum"),
      sym("product"),
      sym("number"),
      sym("variable"),

      // Error recovery is controlled by wrapping rule subtrees with `err`.
      seq({
        str("("),
        err(sym("expression")),
        str(")") }) }) },

    // Tokens like '+' and '*' are described directly within the grammar's rules,
    // as opposed to in a seperate lexer description.
    { "sum", prec_left(1, seq({
      sym("expression"),
      str("+"),
      sym("expression") })) },

    // Ambiguities can be resolved at compile time by assigning precedence
    // values to rule subtrees.
    { "product", prec_left(2, seq({
      sym("expression"),
      str("*"),
      sym("expression") })) },

    // Tokens can be specified using ECMAScript regexps.
    { "number", pattern("\\d+") },
    { "variable", pattern("[a-zA-Z]+\\w*") },
    { "comment", pattern("//.*") },

  }).extra_tokens({

    // Things that can appear anywhere in the language are expressed as
    // 'extra tokens'.
    sym("comment"),
    pattern("\\s+")
  });

  // Generate C code for parsing this language.
  auto output = compile(arithmetic_grammar, "arithmetic");
  std::string c_code = output.first;
  const GrammarError *error = output.second;

  assert(!error);
  puts(c_code.c_str());

  return 0;
}

To create a parser for this language, compile and run this grammar like this:

clang++ -stdlib=libc++ -std=c++11                             \
  -I tree-sitter/include -L tree-sitter/out/Debug -l compiler \
  arithmetic_grammar.cc -o arithmetic_grammar

./arithmetic_grammar > arithmetic_parser.c

Using the parser

The tree_sitter/runtime C library exposes a DOM-style interface for inspecting documents.

Functions like ts_node_child(node, index) and ts_node_next_sibling(node) expose every node in the concrete syntax tree. This is useful for operations like syntax-highlighting, that operate on a token-by-token basis. You can also traverse the tree in a more abstract way by using functions like ts_node_named_child(node, index) and ts_node_next_named_sibling(node). These functions don't expose nodes that were specified in the grammar as anonymous tokens, like ( and +. This is useful when analyzing the meaning of a document.

#include <stdio.h>
#include "tree_sitter/runtime.h"

// Declare the language constructor that was generated from your grammar.
TSLanguage *ts_language_arithmetic();

int main() {
  TSDocument *document = ts_document_make();
  ts_document_set_language(document, ts_language_arithmetic());

  // Usually, you would use the more general `ts_document_set_input`, which
  // takes a struct with function pointers for seeking to positions in the text,
  // and reading chunks of text. This allows you to efficiently parse text
  // stored in your own data structure.
  ts_document_set_input_string(document, "a + b * 5");
  ts_document_parse(document);

  TSNode root_node = ts_document_root_node(document);
  printf(
    "Root name: %s, start: %lu, end: %lu\n",
    ts_node_name(root_node, document),
    ts_node_start_char(root_node),
    ts_node_end_char(root_node)
  );

  TSNode product_node = ts_node_named_child(ts_node_child(root_node, 0), 1);
  printf(
    "Child name: %s, start: %lu, end: %lu\n",
    ts_node_name(product_node, document),
    ts_node_start_char(product_node),
    ts_node_end_char(product_node)
  );

  ts_document_free(document);
  return 0;
}

To demo this parser's capabilities, compile this program like this:

clang                                                        \
  -I tree-sitter/include -L tree-sitter/out/Debug -l runtime \
  arithmetic_parser.c test_parser.c -o test_parser

./test_parser

README.md

tree-sitter

Installation

Writing a grammar

Using the parser

References