From 87ad0fb9c282d2de9721c39fdc1dc6bd73f06888 Mon Sep 17 00:00:00 2001 From: Max Brunsfeld Date: Mon, 13 Aug 2018 18:04:10 -0700 Subject: [PATCH] Expand using parsers document --- docs/section-2-using-parsers.md | 47 ++++++++++++++++++++++++--------- 1 file changed, 34 insertions(+), 13 deletions(-) diff --git a/docs/section-2-using-parsers.md b/docs/section-2-using-parsers.md index 24351bc9..6f5b09ec 100644 --- a/docs/section-2-using-parsers.md +++ b/docs/section-2-using-parsers.md @@ -5,9 +5,19 @@ permalink: using-parsers # Using Parsers -A Tree-sitter parser consists of a single C source file which exports one function with the naming scheme `tree_sitter_${LANGUAGE_NAME}`. This function returns a pointer to a `TSLanguage` struct, which can be used in conjunction with a `TSParser` to produce a syntax trees. +All of Tree-sitter's parsing functionality is exposed through C APIs. Applications written in higher-level languages can use Tree-sitter via binding libraries like [node-tree-sitter](https://github.com/tree-sitter/node-tree-sitter) or [rust-tree-sitter](https://github.com/tree-sitter/rust-tree-sitter), which have their own documentation. -## The Raw C API +This document will describes the general concepts of how to use Tree-sitter, which should be relevant regardless of what language you're using. It also goes into some C-specific details that are useful if you're using the C API directly or are building a new binding to a different language. + +## The Object Model + +There are four main types of objects involved when using Tree-sitter: languages, parsers, syntax trees, and syntax nodes. In C, these are called `TSParser`, `TSLanguage`, `TSTree`, and `TSNode`. +* A `TSLanguage` is an opaque object that defines how to parse a particular programming language. The code for each `TSLanguage` is generated by Tree-sitter. Many languages are already available in separate git repositories within the the [Tree-sitter GitHub organization](https://github.com/tree-sitter). See [the next section](/creating-parsers) for how to create new languages. +* A `TSParser` is a stateful object that can be assigned a `TSLanguage` and used to produce a `TSTree` based on some source code. +* A `TSTree` represents the syntax tree of an entire source code file. Its contains `TSNode` instances that indicate the structure of the source code. It can also be edited and used to produce a new `TSTree` in the event that the source code changes. +* A `TSNode` represents a single node in the syntax tree. It tracks its start and end positions in the source code, as well as its relation to other nodes like its parent, siblings and children. + +## An Example Program Here's an example of a simple C program that uses the Tree-sitter [JSON parser](https://github.com/tree-sitter/tree-sitter-json). @@ -19,26 +29,37 @@ Here's an example of a simple C program that uses the Tree-sitter [JSON parser]( #include #include "tree_sitter/runtime.h" +// Declare the `tree_sitter_json` function, which is +// implemented by the `tree-sitter-json` library. TSLanguage *tree_sitter_json(); int main() { - // Create a parser with the JSON language. + // Create a parser. TSParser *parser = ts_parser_new(); + + // Set the parser's language (JSON in this case). ts_parser_set_language(parser, tree_sitter_json()); - // Parse some source code. + // Build a syntax tree based on source code stored in a string. const char *source_code = "[1, null]"; - TSTree *tree = ts_parser_parse_string(parser, NULL, source_code, strlen(source_code)); + TSTree *tree = ts_parser_parse_string( + parser, + NULL, + source_code, + strlen(source_code) + ); - // Find some syntax tree nodes. + // Get the root node of the syntax tree. TSNode root_node = ts_tree_root_node(tree); + + // Get some child nodes. TSNode array_node = ts_node_named_child(root_node, 0); TSNode number_node = ts_node_named_child(array_node, 0); // Check that the nodes have the expected types. - assert(!strcmp(ts_node_type(root_node), "value")); - assert(!strcmp(ts_node_type(array_node), "array")); - assert(!strcmp(ts_node_type(number_node), "number")); + assert(strcmp(ts_node_type(root_node), "value") == 0); + assert(strcmp(ts_node_type(array_node), "array") == 0); + assert(strcmp(ts_node_type(number_node), "number") == 0); // Check that the nodes have the expected child counts. assert(ts_node_child_count(root_node) == 1); @@ -50,7 +71,7 @@ int main() { char *string = ts_node_string(root_node); printf("Syntax tree: %s\n", string); - // Free all of the heap allocations. + // Free all of the heap-allocated memory. free(string); ts_tree_delete(tree); ts_parser_delete(parser); @@ -58,7 +79,7 @@ int main() { } ``` -This program uses the Tree-sitter C API, which is declared in the header file `tree_sitter/runtime.h`, so we need to add the `tree_sitter/include` directory to the include path. We also need to link `libruntime.a` into the binary. +This program uses the Tree-sitter C API, which is declared in the header file `tree_sitter/runtime.h`, so we need to add the `tree_sitter/include` directory to the include path. We also need to link `libruntime.a` into the binary. We compile the source code of the JSON language directly into the binary as well. ```sh clang \ @@ -71,11 +92,11 @@ clang \ ./test-json-parser ``` -### Providing the text to parse +## Providing the text to parse Text input is provided to a tree-sitter parser via a `TSInput` struct, which specifies a function pointer for reading chunks of text. The text can be encoded in either UTF8 or UTF16. This interface allows you to efficiently parse text that is stored in your own data structure. -### Querying the syntax tree +## Querying the syntax tree Tree-sitter provides a DOM-style interface for inspecting syntax trees. Functions like `ts_node_child(node, index)` and `ts_node_next_sibling(node)` expose every node in the concrete syntax tree. This is useful for operations like syntax-highlighting, which operate on a token-by-token basis. You can also traverse the tree in a more abstract way by using functions like `ts_node_named_child(node, index)` and `ts_node_next_named_sibling(node)`. These functions don't expose nodes that were specified in the grammar as anonymous tokens, like `:` and `{`. This is useful when analyzing the meaning of a document.