tree-sitter/docs/section-2-using-parsers.md
2018-08-13 18:04:10 -07:00

5.1 KiB

title permalink
Using Parsers using-parsers

Using Parsers

All of Tree-sitter's parsing functionality is exposed through C APIs. Applications written in higher-level languages can use Tree-sitter via binding libraries like node-tree-sitter or rust-tree-sitter, which have their own documentation.

This document will describes the general concepts of how to use Tree-sitter, which should be relevant regardless of what language you're using. It also goes into some C-specific details that are useful if you're using the C API directly or are building a new binding to a different language.

The Object Model

There are four main types of objects involved when using Tree-sitter: languages, parsers, syntax trees, and syntax nodes. In C, these are called TSParser, TSLanguage, TSTree, and TSNode.

  • A TSLanguage is an opaque object that defines how to parse a particular programming language. The code for each TSLanguage is generated by Tree-sitter. Many languages are already available in separate git repositories within the the Tree-sitter GitHub organization. See the next section for how to create new languages.
  • A TSParser is a stateful object that can be assigned a TSLanguage and used to produce a TSTree based on some source code.
  • A TSTree represents the syntax tree of an entire source code file. Its contains TSNode instances that indicate the structure of the source code. It can also be edited and used to produce a new TSTree in the event that the source code changes.
  • A TSNode represents a single node in the syntax tree. It tracks its start and end positions in the source code, as well as its relation to other nodes like its parent, siblings and children.

An Example Program

Here's an example of a simple C program that uses the Tree-sitter JSON parser.

// Filename - test-json-parser.c

#include <assert.h>
#include <string.h>
#include <stdio.h>
#include "tree_sitter/runtime.h"

// Declare the `tree_sitter_json` function, which is
// implemented by the `tree-sitter-json` library.
TSLanguage *tree_sitter_json();

int main() {
  // Create a parser.
  TSParser *parser = ts_parser_new();

  // Set the parser's language (JSON in this case).
  ts_parser_set_language(parser, tree_sitter_json());

  // Build a syntax tree based on source code stored in a string.
  const char *source_code = "[1, null]";
  TSTree *tree = ts_parser_parse_string(
    parser,
    NULL,
    source_code,
    strlen(source_code)
  );

  // Get the root node of the syntax tree.
  TSNode root_node = ts_tree_root_node(tree);

  // Get some child nodes.
  TSNode array_node = ts_node_named_child(root_node, 0);
  TSNode number_node = ts_node_named_child(array_node, 0);

  // Check that the nodes have the expected types.
  assert(strcmp(ts_node_type(root_node), "value") == 0);
  assert(strcmp(ts_node_type(array_node), "array") == 0);
  assert(strcmp(ts_node_type(number_node), "number") == 0);

  // Check that the nodes have the expected child counts.
  assert(ts_node_child_count(root_node) == 1);
  assert(ts_node_child_count(array_node) == 4);
  assert(ts_node_named_child_count(array_node) == 2);
  assert(ts_node_child_count(number_node) == 0);

  // Print the syntax tree as an S-expression.
  char *string = ts_node_string(root_node);
  printf("Syntax tree: %s\n", string);

  // Free all of the heap-allocated memory.
  free(string);
  ts_tree_delete(tree);
  ts_parser_delete(parser);
  return 0;
}

This program uses the Tree-sitter C API, which is declared in the header file tree_sitter/runtime.h, so we need to add the tree_sitter/include directory to the include path. We also need to link libruntime.a into the binary. We compile the source code of the JSON language directly into the binary as well.

clang                                   \
  -I tree-sitter/include                \
  test-json-parser.c                    \
  tree-sitter-json/src/parser.c         \
  tree-sitter/out/Release/libruntime.a  \
  -o test-json-parser

./test-json-parser

Providing the text to parse

Text input is provided to a tree-sitter parser via a TSInput struct, which specifies a function pointer for reading chunks of text. The text can be encoded in either UTF8 or UTF16. This interface allows you to efficiently parse text that is stored in your own data structure.

Querying the syntax tree

Tree-sitter provides a DOM-style interface for inspecting syntax trees. Functions like ts_node_child(node, index) and ts_node_next_sibling(node) expose every node in the concrete syntax tree. This is useful for operations like syntax-highlighting, which operate on a token-by-token basis. You can also traverse the tree in a more abstract way by using functions like ts_node_named_child(node, index) and ts_node_next_named_sibling(node). These functions don't expose nodes that were specified in the grammar as anonymous tokens, like : and {. This is useful when analyzing the meaning of a document.