From de0e8a39a28591d0f9df5678faafc3cb6aeb70e0 Mon Sep 17 00:00:00 2001 From: Max Brunsfeld Date: Tue, 14 Aug 2018 12:13:10 -0700 Subject: [PATCH] Expand using parsers section of the docs --- docs/section-2-using-parsers.md | 157 ++++++++++++++++++++++++++++++-- 1 file changed, 151 insertions(+), 6 deletions(-) diff --git a/docs/section-2-using-parsers.md b/docs/section-2-using-parsers.md index 6f5b09ec..ee151426 100644 --- a/docs/section-2-using-parsers.md +++ b/docs/section-2-using-parsers.md @@ -9,7 +9,7 @@ All of Tree-sitter's parsing functionality is exposed through C APIs. Applicatio This document will describes the general concepts of how to use Tree-sitter, which should be relevant regardless of what language you're using. It also goes into some C-specific details that are useful if you're using the C API directly or are building a new binding to a different language. -## The Object Model +## The Objects There are four main types of objects involved when using Tree-sitter: languages, parsers, syntax trees, and syntax nodes. In C, these are called `TSParser`, `TSLanguage`, `TSTree`, and `TSNode`. * A `TSLanguage` is an opaque object that defines how to parse a particular programming language. The code for each `TSLanguage` is generated by Tree-sitter. Many languages are already available in separate git repositories within the the [Tree-sitter GitHub organization](https://github.com/tree-sitter). See [the next section](/creating-parsers) for how to create new languages. @@ -92,11 +92,156 @@ clang \ ./test-json-parser ``` -## Providing the text to parse +## Providing the Source Code -Text input is provided to a tree-sitter parser via a `TSInput` struct, which specifies a function pointer for reading chunks of text. The text can be encoded in either UTF8 or UTF16. This interface allows you to efficiently parse text that is stored in your own data structure. +In the example above, we parsed source code stored in a simple string using the `ts_parser_parse_string` function: -## Querying the syntax tree +```c +TSTree *ts_parser_parse_string( + TSParser *self, + const TSTree *old_tree, + const char *string, + uint32_t length +); +``` -Tree-sitter provides a DOM-style interface for inspecting syntax trees. Functions like `ts_node_child(node, index)` and `ts_node_next_sibling(node)` expose every node in the concrete syntax tree. This is useful for operations like syntax-highlighting, which operate on a token-by-token basis. You can also traverse the tree in a more abstract way by using functions like -`ts_node_named_child(node, index)` and `ts_node_next_named_sibling(node)`. These functions don't expose nodes that were specified in the grammar as anonymous tokens, like `:` and `{`. This is useful when analyzing the meaning of a document. +You may want to parse source code that's stored in a custom data structure, like a [piece table](https://en.wikipedia.org/wiki/Piece_table) or a [rope](https://en.wikipedia.org/wiki/Rope_(data_structure)). In this case, you can use the more general `ts_parser_parse` function: + +```c +TSTree *ts_parser_parse( + TSParser *self, + const TSTree *old_tree, + TSInput input +); +``` + +The `TSInput` structure lets you to provide your own function for reading a chunk of text at a given byte offset and row/column position. The function can return text encoded in either UTF8 or UTF16. This interface allows you to efficiently parse text that is stored in your own data structure. + +```c +typedef struct { + void *payload; + const char *(*read)( + void *payload, + uint32_t byte_offset, + TSPoint position, + uint32_t *bytes_read + ); + TSInputEncoding encoding; +} TSInput; +``` + +## Syntax Nodes + +Tree-sitter provides a [DOM](https://en.wikipedia.org/wiki/Document_Object_Model)-style interface for inspecting syntax trees. A syntax node's *type* is a string that indicates which grammar rule the node represents. + +```c +const char *ts_node_type(TSNode); +``` + +Syntax nodes store their position in the source code both in terms of raw bytes and row/column coordinates: + +```c +uint32_t ts_node_start_byte(TSNode); +uint32_t ts_node_end_byte(TSNode); + +typedef struct { + uint32_t row; + uint32_t column; +} TSPoint; + +TSPoint ts_node_start_point(TSNode); +TSPoint ts_node_end_point(TSNode); +``` + +## Retrieving Nodes + +Every tree has a *root node*: + +```c +TSNode ts_tree_root_node(const TSTree *); +``` + +Once you have a node, you can access the node's children: + +```c +uint32_t ts_node_child_count(TSNode); +TSNode ts_node_child(TSNode, uint32_t); +``` + +You can also access its siblings and parent: + +```c +TSNode ts_node_next_sibling(TSNode); +TSNode ts_node_prev_sibling(TSNode); +TSNode ts_node_parent(TSNode); +``` + +These methods may all return a *null node* to indicate, for example, that a node does not *have* a next sibling. You can check if a node is null: + +```c +bool ts_node_is_null(TSNode); +``` + +## Named vs Anonymous Nodes + +Tree-sitter produces [*concrete* syntax trees](https://en.wikipedia.org/wiki/Parse_tree) - trees that contain nodes for every individual token in the source code, including things like commas and parentheses. This is important for use-cases that deal with individual tokens, like [syntax highlighting](https://en.wikipedia.org/wiki/Syntax_highlighting). But some types of code analysis are easier to perform using an [*abstract* syntax tree](https://en.wikipedia.org/wiki/Abstract_syntax_tree) - a tree in which the less important details have been removed. Tree-sitter's trees support these use cases by making a distinction between *named* and *anonymous* nodes. + +Consider a grammar rule like this: + +```js +if_statement: $ => seq( + 'if', + '(', + $._expression, + ')', + $._statement, +) +``` + +A syntax node representing an `if_statement` in this language would have 5 children: the condition expression, the body statement, as well as the `if`, `(`, and `)` tokens. The expression and the statement would be marked as *named* nodes, because they have been given explicit names in the grammar. But the `if`, `(`, and `)` nodes would *not* be named nodes, because they are represented in the grammar as simple strings. + +You can check whether any given node is named: + +```c +bool ts_node_is_named(TSNode); +``` + +When traversing the tree, you can also choose to skip over anonymous nodes by using the `_named_` variants of all of the methods described above: + +```c +TSNode ts_node_named_child(TSNode, uint32_t); +uint32_t ts_node_named_child_count(TSNode); +TSNode ts_node_next_named_sibling(TSNode); +TSNode ts_node_prev_named_sibling(TSNode); +``` + +If you use this group of methods, the syntax tree functions much like an abstract syntax tree. + +## Editing + +In applications like text editors, you often need to re-parse a file after its source code has changed. Tree-sitter is designed to support this use case efficiently. There are two steps required. First, you must *edit* the syntax tree, which adjusts the ranges of its nodes so that they stay in sync with the code. + +```c +typedef struct { + uint32_t start_byte; + uint32_t old_end_byte; + uint32_t new_end_byte; + TSPoint start_point; + TSPoint old_end_point; + TSPoint new_end_point; +} TSInputEdit; + +void ts_node_edit(TSNode *, const TSInputEdit *); +``` + +Then, you can call `ts_parser_parse` again, passing in the old tree. This will create a new tree that internally shares structure with the old tree. + +## Concurrency + +Tree-sitter supports multi-threaded use cases by making syntax trees very cheap to copy. + +```c +TSTree *ts_tree_copy(const TSTree *); +``` + +Internally, copying a syntax tree just entails incrementing an atomic reference count. Conceptually, it provides you a new tree which you can freely query, edit, reparse, or delete on a new thread while continuing to use the original tree on a different thread. Note that individual `TSTree` instances are *not* thread safe; you must copy a tree if you want to use it on multiple threads simultaneously.