An incremental parsing system for programming tools https://tree-sitter.github.io

c incremental parser parsing rust tree-sitter wasm

Find a file

Phil Turnbull 6897530c47 Check for invalid state indexes Some ParseActions have a state-id of -1 which can cause an out-of-bounds read when removing duplicate parse states. This was found by AddressSanitizer: ==90699==ERROR: AddressSanitizer: heap-buffer-overflow on address 0x6320000187f8 at pc 0x0001071220a9 bp 0x7fff595fd440 sp 0x7fff595fd438 READ of size 8 at 0x6320000187f8 thread T0 #0 0x1071220a8 in tree_sitter::build_tables::ParseTableBuilder::remove_duplicate_parse_states()::'lambda0'(unsigned long)::operator()(unsigned long) const build_parse_table.cc:398 #1 0x107121fa5 in void std::__1::__invoke_void_return_wrapper<void>::__call<tree_sitter::build_tables::ParseTableBuilder::remove_duplicate_parse_states()::'lambda0'(unsigned long)&, unsigned long>(tree_sitter::build_tables::ParseTableBuilder::remove_duplicate_parse_states()::'lambda0'(unsigned long)&&&, unsigned long&&) __functional_base:416 ... 0x6320000187f8 is located 8 bytes to the left of 88264-byte region [0x632000018800,0x63200002e0c8) allocated by thread T0 here: #0 0x107b1576b in wrap__Znwm (libclang_rt.asan_osx_dynamic.dylib:x86_64h+0x6076b) #1 0x10711da2c in std::__1::vector<unsigned long, std::__1::allocator<unsigned long> >::allocate(unsigned long) new:169 #2 0x10711d8fb in std::__1::vector<unsigned long, std::__1::allocator<unsigned long> >::vector(unsigned long) vector:1074 #3 0x107112f5c in std::__1::vector<unsigned long, std::__1::allocator<unsigned long> >::vector(unsigned long) vector:1068 #4 0x1070af381 in tree_sitter::build_tables::ParseTableBuilder::remove_duplicate_parse_states() build_parse_table.cc:378 #5 0x10709d827 in tree_sitter::build_tables::ParseTableBuilder::build() build_parse_table.cc:85 ... SUMMARY: AddressSanitizer: heap-buffer-overflow build_parse_table.cc:398 in tree_sitter::build_tables::ParseTableBuilder::remove_duplicate_parse_states()::'lambda0'(unsigned long)::operator()(unsigned long) const Shadow bytes around the buggy address: 0x1c64000030a0: fa fa fa fa fa fa fa fa fa fa fa fa fa fa fa fa 0x1c64000030b0: fa fa fa fa fa fa fa fa fa fa fa fa fa fa fa fa 0x1c64000030c0: fa fa fa fa fa fa fa fa fa fa fa fa fa fa fa fa 0x1c64000030d0: fa fa fa fa fa fa fa fa fa fa fa fa fa fa fa fa 0x1c64000030e0: fa fa fa fa fa fa fa fa fa fa fa fa fa fa fa fa =>0x1c64000030f0: fa fa fa fa fa fa fa fa fa fa fa fa fa fa fa[fa] 0x1c6400003100: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 0x1c6400003110: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 0x1c6400003120: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 0x1c6400003130: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 0x1c6400003140: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00		2017-06-07 17:23:44 -04:00
doc	Update grammar JSON schema	2017-03-17 16:41:30 -07:00
externals	Revert "Bump bandit"	2016-11-30 18:28:04 -06:00
include/tree_sitter	Add an option to immediately halt on syntax error	2017-05-01 13:50:49 -07:00
script	Use new version of python grammar in tests	2017-03-17 17:05:02 -07:00
src	Check for invalid state indexes	2017-06-07 17:23:44 -04:00
test	Fix parsing of valid code with halt_on_error flag set	2017-05-01 14:25:25 -07:00
.clang-format	Auto-format: no single-line functions	2015-07-31 16:32:24 -07:00
.clang_complete	Implement Rule as a union rather than an abstract base class	2017-03-17 13:29:31 -07:00
.gitignore	Clean up gitignore	2017-03-09 20:49:11 -08:00
.gitmodules	Add plain C API for compiling a JSON grammar	2016-01-10 13:44:22 -08:00
.travis.yml	Remove google perftools dependency	2016-11-30 17:23:44 -08:00
LICENSE	Add MIT license	2017-05-03 10:32:12 -07:00
project.gyp	Remove RTTI flag in gyp file	2017-03-17 13:31:35 -07:00
README.md	Update language function name in README	2017-01-31 11:41:46 -08:00
tests.gyp	Implement Rule as a union rather than an abstract base class	2017-03-17 13:29:31 -07:00

README.md

tree-sitter

Tree-sitter is a C library for incremental parsing, intended to be used via bindings to higher-level languages. It can be used to build a concrete syntax tree for a program and efficiently update the syntax tree as the program is edited. This makes it suitable for use in text-editing programs.

Tree-sitter uses a sentential-form incremental LR parsing algorithm, as described in the paper Efficient and Flexible Incremental Parsing by Tim Wagner. It handles ambiguity at compile-time via precedence annotations, and at run-time via the GLR algorithm. This allows it to generate a fast parser for any context-free grammar.

Installation

script/configure # Generate a Makefile
make             # Build static libraries for the compiler and runtime

Overview

Tree-sitter consists of two libraries. The first library, libcompiler, can be used to generate a parser for a language by supplying a context-free grammar describing the language. Once the parser has been generated, libcompiler is no longer needed.

The second library, libruntime, is used in combination with the parsers generated by libcompiler, to generate syntax trees based on text documents, and keep the syntax trees up-to-date as changes are made to the documents.

Writing a grammar

Tree-sitter's grammars are specified as JSON strings. This format allows them to be easily created and manipulated in high-level languages like JavaScript. The structure of a grammar is formally specified by this JSON schema. You can generate a parser for a grammar using the ts_compile_grammar function provided by libcompiler.

Here's a simple example of using ts_compile_grammar to create a parser for basic arithmetic expressions. It uses C++11 multi-line strings for readability.

// arithmetic_grammar.cc

#include <stdio.h>
#include "tree_sitter/compiler.h"

int main() {
  TSCompileResult result = ts_compile_grammar(R"JSON(
    {
      "name": "arithmetic",

      // Things that can appear anywhere in the language, like comments
      // and whitespace, are expressed as 'extras'.
      "extras": [
        {"type": "PATTERN", "value": "\\s"},
        {"type": "SYMBOL", "name": "comment"}
      ],

      "rules": {

        // The first rule listed in the grammar becomes the 'start rule'.
        "expression": {
          "type": "CHOICE",
          "members": [
            {"type": "SYMBOL", "name": "sum"},
            {"type": "SYMBOL", "name": "product"},
            {"type": "SYMBOL", "name": "number"},
            {"type": "SYMBOL", "name": "variable"},
            {
              "type": "SEQ",
              "members": [
                {"type": "STRING", "value": "("},
                {"type": "SYMBOL", "name": "expression"},
                {"type": "STRING", "value": ")"}
              ]
            }
          ]
        },

        // Tokens like '+' and '*' are described directly within the
        // grammar's rules, as opposed to in a seperate lexer description.
        "sum": {
          "type": "PREC_LEFT",
          "value": 1,
          "content": {
            "type": "SEQ",
            "members": [
              {"type": "SYMBOL", "name": "expression"},
              {"type": "STRING", "value": "+"},
              {"type": "SYMBOL", "name": "expression"}
            ]
          }
        },

        // Ambiguities can be resolved at compile time by assigning precedence
        // values to rule subtrees.
        "product": {
          "type": "PREC_LEFT",
          "value": 2,
          "content": {
            "type": "SEQ",
            "members": [
              {"type": "SYMBOL", "name": "expression"},
              {"type": "STRING", "value": "*"},
              {"type": "SYMBOL", "name": "expression"}
            ]
          }
        },

        // Tokens can be specified using ECMAScript regexps.
        "number": {"type": "PATTERN", "value": "\\d+"},
        "comment": {"type": "PATTERN", "value": "#.*"},
        "variable": {"type": "PATTERN", "value": "[a-zA-Z]\\w*"},
      }
    }
  )JSON");

  if (result.error_type != TSCompileErrorTypeNone) {
    fprintf(stderr, "Compilation failed: %s\n", result.error_message);
    return 1;
  }

  puts(result.code);

  return 0;
}

To create the parser, compile this file like this:

clang++ -std=c++11 \
  -I tree-sitter/include \
  -L tree-sitter/out/Release \
  -l compiler \
  arithmetic_grammar.cc \
  -o arithmetic_grammar

Then run the executable to print out the C code for the parser:

./arithmetic_grammar > arithmetic_parser.c

Using the parser

Providing the text to parse

Text input is provided to a tree-sitter parser via a TSInput struct, which contains function pointers for seeking to positions in the text, and for reading chunks of text. The text can be encoded in either UTF8 or UTF16. This interface allows you to efficiently parse text that is stored in your own data structure.

Querying the syntax tree

The libruntime API provides a DOM-style interface for inspecting syntax trees. Functions like ts_node_child(node, index) and ts_node_next_sibling(node) expose every node in the concrete syntax tree. This is useful for operations like syntax-highlighting, which operate on a token-by-token basis. You can also traverse the tree in a more abstract way by using functions like ts_node_named_child(node, index) and ts_node_next_named_sibling(node). These functions don't expose nodes that were specified in the grammar as anonymous tokens, like ( and +. This is useful when analyzing the meaning of a document.

// test_parser.c

#include <assert.h>
#include <string.h>
#include <stdio.h>
#include "tree_sitter/runtime.h"

// Declare the language function that was generated from your grammar.
TSLanguage *tree_sitter_arithmetic();

int main() {
  TSDocument *document = ts_document_new();
  ts_document_set_language(document, tree_sitter_arithmetic());
  ts_document_set_input_string(document, "a + b * 5");
  ts_document_parse(document);

  TSNode root_node = ts_document_root_node(document);
  assert(!strcmp(ts_node_type(root_node, document), "expression"));
  assert(ts_node_named_child_count(root_node) == 1);

  TSNode sum_node = ts_node_named_child(root_node, 0);
  assert(!strcmp(ts_node_type(sum_node, document), "sum"));
  assert(ts_node_named_child_count(sum_node) == 2);

  TSNode product_node = ts_node_child(ts_node_named_child(sum_node, 1), 0);
  assert(!strcmp(ts_node_type(product_node, document), "product"));
  assert(ts_node_named_child_count(product_node) == 2);

  printf("Syntax tree: %s\n", ts_node_string(root_node, document));
  ts_document_free(document);
  return 0;
}

To demo this parser's capabilities, compile this program like this:

clang \
  -I tree-sitter/include \
  -L tree-sitter/out/Debug \
  -l runtime \
  arithmetic_parser.c test_parser.c \
  -o test_parser

./test_parser