tree-sitter/docs/section-2-using-parsers.md

572 lines
21 KiB
Markdown
Raw Normal View History

2018-06-10 09:54:59 -07:00
---
title: Using Parsers
permalink: using-parsers
---
2019-05-31 13:06:39 -07:00
# Using Parsers
2018-06-10 09:54:59 -07:00
All of Tree-sitter's parsing functionality is exposed through C APIs. Applications written in higher-level languages can use Tree-sitter via binding libraries like [node-tree-sitter](https://github.com/tree-sitter/node-tree-sitter) or [rust-tree-sitter](https://github.com/tree-sitter/tree-sitter/tree/master/lib/binding_rust), which have their own documentation.
2018-06-11 19:17:10 -07:00
2018-08-13 18:04:10 -07:00
This document will describes the general concepts of how to use Tree-sitter, which should be relevant regardless of what language you're using. It also goes into some C-specific details that are useful if you're using the C API directly or are building a new binding to a different language.
All of the API functions shown here are declared and documented in the `tree_sitter/api.h` header file.
## Getting Started
### Building the Library
To build the library on a POSIX system, run this script, which will create a static library called `libtree-sitter.a` in the Tree-sitter folder:
```sh
script/build-lib
```
lib: remove utf8proc dependency (#436) * Remove dependency on utf8proc This removes the only external dependency on utf8proc for UTF-8 decoding. It does so by implementing its own UTF-8 decoder. This decoder is both faster and has a simpler API. * .gitmodules: remove utf8proc submodule * docs/section-2-using-parsers.md: remove requirement for utf8proc submodule * docs/section-6-contributing.md: likewise * lib/Cargo.toml: remove utf8proc subdirectory package include * lib/README.md: remove utf8proc subdirectory description * lib/binding_rust/build.rs: remove utf8proc compiler include directory * lib/src/lexer.c: remove utf8proc dependencies and types * lib/src/lib.c: remove utf8proc dependency * lib/src/unicode.h: define types for Unicode decoders * lib/src/utf16.{c,h}: implement more readable UTF-16 decoder * lib/src/utf8.{c,h}: implement fast UTF-8 decoder * lib/utf8proc: remove utf8proc submodule directory * script/build-lib: remove utf8proc compiler include directory * script/build-wasm: likewise * Optimize ts_lexer__get_lookahead. Try to favor non-failure code path and assign lookahead values directly to lexer * lib/src/lexer.c: optimize for non-failure code path * Fix some compiler errors * lib/src/lexer.c: cast from signed to unsigned for decode_next result * lib/src/utf16.c: fix non-constant initializers for older compilers * Remove some missed remnants of utf8proc * docs/section-2-using-parsers.md: only two include paths necessary now * lib/src/lib.c: no need to define UTF8PROC_STATIC * Use ICU's utf8 and utf16 decoding routines * Remove unnecessary casts when calling icu macros * Check buffer length before attempting to decode a unicode character * Use new unicode function when parsing Queries Co-Authored-By: Matthew Krupcale <mkrupcale@matthewkrupcale.com> * Mark libicu files as vendored for GitHub's stats
2019-10-14 14:18:39 -04:00
Alternatively, you can use the library in a larger project by adding one source file to the project. This source file needs two directories to be in the include path when compiled:
**source file:**
* `tree-sitter/lib/src/lib.c`
**include directories:**
* `tree-sitter/lib/src`
* `tree-sitter/lib/include`
### The Basic Objects
2018-08-13 18:04:10 -07:00
There are four main types of objects involved when using Tree-sitter: languages, parsers, syntax trees, and syntax nodes. In C, these are called `TSLanguage`, `TSParser`, `TSTree`, and `TSNode`.
* A `TSLanguage` is an opaque object that defines how to parse a particular programming language. The code for each `TSLanguage` is generated by Tree-sitter. Many languages are already available in separate git repositories within the the [Tree-sitter GitHub organization](https://github.com/tree-sitter). See [the next page](./creating-parsers) for how to create new languages.
2018-08-13 18:04:10 -07:00
* A `TSParser` is a stateful object that can be assigned a `TSLanguage` and used to produce a `TSTree` based on some source code.
2019-05-30 17:52:54 -07:00
* A `TSTree` represents the syntax tree of an entire source code file. It contains `TSNode` instances that indicate the structure of the source code. It can also be edited and used to produce a new `TSTree` in the event that the source code changes.
2018-08-13 18:04:10 -07:00
* A `TSNode` represents a single node in the syntax tree. It tracks its start and end positions in the source code, as well as its relation to other nodes like its parent, siblings and children.
### An Example Program
2018-06-11 19:17:10 -07:00
Here's an example of a simple C program that uses the Tree-sitter [JSON parser](https://github.com/tree-sitter/tree-sitter-json).
```c
// Filename - test-json-parser.c
#include <assert.h>
#include <string.h>
#include <stdio.h>
#include <tree_sitter/api.h>
2018-06-11 19:17:10 -07:00
2018-08-13 18:04:10 -07:00
// Declare the `tree_sitter_json` function, which is
// implemented by the `tree-sitter-json` library.
2018-06-11 19:17:10 -07:00
TSLanguage *tree_sitter_json();
int main() {
2018-08-13 18:04:10 -07:00
// Create a parser.
2018-06-11 19:17:10 -07:00
TSParser *parser = ts_parser_new();
2018-08-13 18:04:10 -07:00
// Set the parser's language (JSON in this case).
2018-06-11 19:17:10 -07:00
ts_parser_set_language(parser, tree_sitter_json());
2018-08-13 18:04:10 -07:00
// Build a syntax tree based on source code stored in a string.
2018-06-11 19:17:10 -07:00
const char *source_code = "[1, null]";
2018-08-13 18:04:10 -07:00
TSTree *tree = ts_parser_parse_string(
parser,
NULL,
source_code,
strlen(source_code)
);
// Get the root node of the syntax tree.
2018-06-11 19:17:10 -07:00
TSNode root_node = ts_tree_root_node(tree);
2018-08-13 18:04:10 -07:00
// Get some child nodes.
2018-06-11 19:17:10 -07:00
TSNode array_node = ts_node_named_child(root_node, 0);
TSNode number_node = ts_node_named_child(array_node, 0);
// Check that the nodes have the expected types.
2018-08-13 18:04:10 -07:00
assert(strcmp(ts_node_type(root_node), "value") == 0);
assert(strcmp(ts_node_type(array_node), "array") == 0);
assert(strcmp(ts_node_type(number_node), "number") == 0);
2018-06-11 19:17:10 -07:00
// Check that the nodes have the expected child counts.
assert(ts_node_child_count(root_node) == 1);
assert(ts_node_child_count(array_node) == 5);
2018-06-11 19:17:10 -07:00
assert(ts_node_named_child_count(array_node) == 2);
assert(ts_node_child_count(number_node) == 0);
// Print the syntax tree as an S-expression.
char *string = ts_node_string(root_node);
printf("Syntax tree: %s\n", string);
2018-08-13 18:04:10 -07:00
// Free all of the heap-allocated memory.
2018-06-11 19:17:10 -07:00
free(string);
ts_tree_delete(tree);
ts_parser_delete(parser);
return 0;
}
```
This program uses the Tree-sitter C API, which is declared in the header file `tree-sitter/api.h`, so we need to add the `tree-sitter/lib/include` directory to the include path. We also need to link `libtree-sitter.a` into the binary. We compile the source code of the JSON language directly into the binary as well.
2018-06-11 19:17:10 -07:00
```sh
clang \
-I tree-sitter/lib/include \
2018-06-11 19:17:10 -07:00
test-json-parser.c \
tree-sitter-json/src/parser.c \
tree-sitter/libtree-sitter.a \
2018-06-11 19:17:10 -07:00
-o test-json-parser
./test-json-parser
```
## Basic Parsing
### Providing the Code
2018-06-11 19:17:10 -07:00
In the example above, we parsed source code stored in a simple string using the `ts_parser_parse_string` function:
2018-06-11 19:17:10 -07:00
```c
TSTree *ts_parser_parse_string(
TSParser *self,
const TSTree *old_tree,
const char *string,
uint32_t length
);
```
You may want to parse source code that's stored in a custom data structure, like a [piece table](https://en.wikipedia.org/wiki/Piece_table) or a [rope](https://en.wikipedia.org/wiki/Rope_(data_structure)). In this case, you can use the more general `ts_parser_parse` function:
```c
TSTree *ts_parser_parse(
TSParser *self,
const TSTree *old_tree,
TSInput input
);
```
The `TSInput` structure lets you to provide your own function for reading a chunk of text at a given byte offset and row/column position. The function can return text encoded in either UTF8 or UTF16. This interface allows you to efficiently parse text that is stored in your own data structure.
```c
typedef struct {
void *payload;
const char *(*read)(
void *payload,
uint32_t byte_offset,
TSPoint position,
uint32_t *bytes_read
);
TSInputEncoding encoding;
} TSInput;
```
### Syntax Nodes
Tree-sitter provides a [DOM](https://en.wikipedia.org/wiki/Document_Object_Model)-style interface for inspecting syntax trees. A syntax node's *type* is a string that indicates which grammar rule the node represents.
```c
const char *ts_node_type(TSNode);
```
Syntax nodes store their position in the source code both in terms of raw bytes and row/column coordinates:
```c
uint32_t ts_node_start_byte(TSNode);
uint32_t ts_node_end_byte(TSNode);
typedef struct {
uint32_t row;
uint32_t column;
} TSPoint;
TSPoint ts_node_start_point(TSNode);
TSPoint ts_node_end_point(TSNode);
```
### Retrieving Nodes
Every tree has a *root node*:
```c
TSNode ts_tree_root_node(const TSTree *);
```
Once you have a node, you can access the node's children:
```c
uint32_t ts_node_child_count(TSNode);
TSNode ts_node_child(TSNode, uint32_t);
```
You can also access its siblings and parent:
```c
TSNode ts_node_next_sibling(TSNode);
TSNode ts_node_prev_sibling(TSNode);
TSNode ts_node_parent(TSNode);
```
These methods may all return a *null node* to indicate, for example, that a node does not *have* a next sibling. You can check if a node is null:
```c
bool ts_node_is_null(TSNode);
```
### Named vs Anonymous Nodes
Tree-sitter produces [*concrete* syntax trees](https://en.wikipedia.org/wiki/Parse_tree) - trees that contain nodes for every individual token in the source code, including things like commas and parentheses. This is important for use-cases that deal with individual tokens, like [syntax highlighting](https://en.wikipedia.org/wiki/Syntax_highlighting). But some types of code analysis are easier to perform using an [*abstract* syntax tree](https://en.wikipedia.org/wiki/Abstract_syntax_tree) - a tree in which the less important details have been removed. Tree-sitter's trees support these use cases by making a distinction between *named* and *anonymous* nodes.
Consider a grammar rule like this:
```js
if_statement: $ => seq(
'if',
'(',
$._expression,
')',
$._statement,
)
```
A syntax node representing an `if_statement` in this language would have 5 children: the condition expression, the body statement, as well as the `if`, `(`, and `)` tokens. The expression and the statement would be marked as *named* nodes, because they have been given explicit names in the grammar. But the `if`, `(`, and `)` nodes would *not* be named nodes, because they are represented in the grammar as simple strings.
You can check whether any given node is named:
```c
bool ts_node_is_named(TSNode);
```
When traversing the tree, you can also choose to skip over anonymous nodes by using the `_named_` variants of all of the methods described above:
```c
TSNode ts_node_named_child(TSNode, uint32_t);
uint32_t ts_node_named_child_count(TSNode);
TSNode ts_node_next_named_sibling(TSNode);
TSNode ts_node_prev_named_sibling(TSNode);
```
If you use this group of methods, the syntax tree functions much like an abstract syntax tree.
### Node Field Names
2019-10-04 17:49:19 -04:00
To make syntax nodes easier to analyze, many grammars assign unique *field names* to particular child nodes. The next page [explains](./creating-parsers#using-fields) how to do this on your own grammars. If a syntax node has fields, you can access its children using their field name:
```c
TSNode ts_node_child_by_field_name(
TSNode self,
const char *field_name,
uint32_t field_name_length
);
```
Fields also have numeric ids that you can use, if you want to avoid repeated string comparisons. You can convert between strings and ids using the `TSLanguage`:
```c
uint32_t ts_language_field_count(const TSLanguage *);
const char *ts_language_field_name_for_id(const TSLanguage *, TSFieldId);
TSFieldId ts_language_field_id_for_name(const TSLanguage *, const char *, uint32_t);
```
The field ids can be used in place of the name:
```c
TSNode ts_node_child_by_field_id(TSNode, TSFieldId);
```
## Advanced Parsing
### Editing
In applications like text editors, you often need to re-parse a file after its source code has changed. Tree-sitter is designed to support this use case efficiently. There are two steps required. First, you must *edit* the syntax tree, which adjusts the ranges of its nodes so that they stay in sync with the code.
```c
typedef struct {
uint32_t start_byte;
uint32_t old_end_byte;
uint32_t new_end_byte;
TSPoint start_point;
TSPoint old_end_point;
TSPoint new_end_point;
} TSInputEdit;
void ts_tree_edit(TSTree *, const TSInputEdit *);
```
Then, you can call `ts_parser_parse` again, passing in the old tree. This will create a new tree that internally shares structure with the old tree.
When you edit a syntax tree, the positions of its nodes will change. If you have stored any `TSNode` instances outside of the `TSTree`, you must update their positions separately, using the same `TSInput` value, in order to update their cached positions.
```c
void ts_node_edit(TSNode *, const TSInputEdit *);
```
This `ts_node_edit` function is *only* needed in the case where you have retrieved `TSNode` instances *before* editing the tree, and then *after* editing the tree, you want to continue to use those specific node instances. Often, you'll just want to re-fetch nodes from the edited tree, in which case `ts_node_edit` is not needed.
### Multi-language Documents
Sometimes, different parts of a file may be written in different languages. For example, templating languages like [EJS](http://ejs.co) and [ERB](https://ruby-doc.org/stdlib-2.5.1/libdoc/erb/rdoc/ERB.html) allow you to generate HTML by writing a mixture of HTML and another language like JavaScript or Ruby.
Tree-sitter handles these types of documents by allowing you to create a syntax tree based on the text in certain *ranges* of a file.
```c
typedef struct {
TSPoint start_point;
TSPoint end_point;
uint32_t start_byte;
uint32_t end_byte;
} TSRange;
void ts_parser_set_included_ranges(
TSParser *self,
const TSRange *ranges,
uint32_t range_count
);
```
For example, consider this ERB document:
```erb
<ul>
<% people.each do |person| %>
<li><%= person.name %></li>
<% end %>
</ul>
```
Conceptually, it can be represented by three syntax trees with overlapping ranges: an ERB syntax tree, a Ruby syntax tree, and an HTML syntax tree. You could generate these syntax trees with the following code:
```c
#include <string.h>
#include <tree_sitter/api.h>
// These functions are each implemented in their own repo.
const TSLanguage *tree_sitter_embedded_template();
const TSLanguage *tree_sitter_html();
const TSLanguage *tree_sitter_ruby();
int main(int argc, const char **argv) {
const char *text = argv[1];
unsigned len = strlen(src);
// Parse the entire text as ERB.
TSParser *parser = ts_parser_new();
ts_parser_set_language(parser, tree_sitter_embedded_template());
TSTree *erb_tree = ts_parser_parse_string(parser, NULL, text, len);
TSNode erb_root_node = ts_tree_root_node(erb_tree);
// In the ERB syntax tree, find the ranges of the `content` nodes,
// which represent the underlying HTML, and the `code` nodes, which
// represent the interpolated Ruby.
TSRange html_ranges[10];
TSRange ruby_ranges[10];
unsigned html_range_count = 0;
unsigned ruby_range_count = 0;
unsigned child_count = ts_node_child_count(erb_root_node);
for (unsigned i = 0; i < child_count; i++) {
TSNode node = ts_node_child(erb_root_node, i);
if (strcmp(ts_node_type(node), "content") == 0) {
html_ranges[html_range_count++] = (TSRange) {
ts_node_start_point(node),
ts_node_end_point(node),
ts_node_start_byte(node),
ts_node_end_byte(node),
};
} else {
TSNode code_node = ts_node_named_child(node, 0);
ruby_ranges[ruby_range_count++] = (TSRange) {
ts_node_start_point(code_node),
ts_node_end_point(code_node),
ts_node_start_byte(code_node),
ts_node_end_byte(code_node),
};
}
}
// Use the HTML ranges to parse the HTML.
ts_parser_set_language(parser, tree_sitter_html());
ts_parser_set_included_ranges(parser, html_ranges, html_range_count);
TSTree *html_tree = ts_parser_parse_string(parser, NULL, text, len);
TSNode html_root_node = ts_tree_root_node(html_tree);
// Use the Ruby ranges to parse the Ruby.
ts_parser_set_language(parser, tree_sitter_ruby());
ts_parser_set_included_ranges(parser, ruby_ranges, ruby_range_count);
TSTree *ruby_tree = ts_parser_parse_string(parser, NULL, text, len);
TSNode ruby_root_node = ts_tree_root_node(ruby_tree);
// Print all three trees.
char *erb_sexp = ts_node_string(erb_root_node);
char *html_sexp = ts_node_string(html_root_node);
char *ruby_sexp = ts_node_string(ruby_root_node);
printf("ERB: %s\n", erb_sexp);
printf("HTML: %s\n", html_sexp);
printf("Ruby: %s\n", ruby_sexp);
return 0;
}
```
This API allows for great flexibility in how languages can be composed. Tree-sitter is not responsible for mediating the interactions between languages. Instead, you are free to do that using arbitrary application-specific logic.
### Concurrency
Tree-sitter supports multi-threaded use cases by making syntax trees very cheap to copy.
```c
TSTree *ts_tree_copy(const TSTree *);
```
2018-06-11 19:17:10 -07:00
Internally, copying a syntax tree just entails incrementing an atomic reference count. Conceptually, it provides you a new tree which you can freely query, edit, reparse, or delete on a new thread while continuing to use the original tree on a different thread. Note that individual `TSTree` instances are *not* thread safe; you must copy a tree if you want to use it on multiple threads simultaneously.
## Other Tree Operations
### Walking Trees with Tree Cursors
You can access every node in a syntax tree using the `TSNode` APIs [described above](#retrieving-nodes), but if you need to access a large number of nodes, the most efficient way to do it is with a *tree cursor*. A cursor is a stateful object that allows you to walk a syntax tree with maximum efficiency.
You can initialize a cursor from any node:
```c
TSTreeCursor ts_tree_cursor_new(TSNode);
```
You can move the cursor around the tree:
```c
bool ts_tree_cursor_goto_first_child(TSTreeCursor *);
bool ts_tree_cursor_goto_next_sibling(TSTreeCursor *);
bool ts_tree_cursor_goto_parent(TSTreeCursor *);
```
These methods return `true` if the cursor successfully moved and `false` if there was no node to move to.
You can always retrieve the cursor's current node, as well as the [field name](#node-field-names) that is associated with the current node.
```c
TSNode ts_tree_cursor_current_node(const TSTreeCursor *);
const char *ts_tree_cursor_current_field_name(const TSTreeCursor *);
TSFieldId ts_tree_cursor_current_field_id(const TSTreeCursor *);
```
### Pattern Matching with Queries
2019-10-28 17:32:10 -07:00
Many code analysis tasks involve searching for patterns in syntax trees. Tree-sitter provides a small declarative language for expressing these patterns and searching for matches. The language is similar to the format of Tree-sitter's [unit test system](./creating-parsers#command-test).
#### Basics
2019-10-28 17:32:10 -07:00
Syntax trees are written as [S-expressions](https://en.wikipedia.org/wiki/S-expression). An S-expression for a node consists of a pair of parentheses containing the node's name and, optionally, a series of S-expressions representing the node's children.
For example, this pattern would match a `binary_expression` node whose children are both `number_literal` nodes:
```
(binary_expression (number_literal) (number_literal))
```
Children can also be omitted. For example, this would match any `binary_expression` where at least *one* of child is a `string_literal` node:
```
(binary_expression (string_literal))
```
#### Fields
2019-10-28 17:32:10 -07:00
In general, it's a good idea to make patterns more specific by specifying field names associated with child nodes. For example, this pattern would match an `assignment_expression` node whose *left* child is a `member_expression` with a `call_expression` for its `object`.
```
(assignment_expression
left: (member_expression
object: (call_expression)))
```
#### Anonymous Nodes
The parenthesized syntax for writing nodes only applies to [named nodes](#named-vs-anonymous-nodes). To match specific anonymous nodes, you write their name between double quotes. For example, this pattern would match any `binary_expression` where the operator is `!=` and the right side is `null`:
```
(binary_expression
operator: "!="
right: (null))
```
#### Capturing Nodes
When matching patterns, you may want to process specific nodes within the pattern. Captures allow you to associate names with specific nodes in a pattern, so that you can later refer to those nodes by those names. Capture names are written *after* the nodes that they refer to, and start with an `@` character.
For example, this pattern would match any assignment of a `function` to an `identifier`, and it would associate the name `function-definition` with the identifier:
```
(assignment_expression
2019-10-28 17:32:10 -07:00
left: (identifier) @function-definition
right: (function))
```
And this pattern would match all method definitions, associating the name `the-method-name` with the method name, `the-class-name` with the containing class name:
```
(class_declaration
name: (identifier) @the-class-name
body: (class_body
(method_definition
name: (property_identifier) @the-method-name)))
```
#### Predicates
You can also specify other conditions that should restrict the nodes that match a given pattern. You do this by enclosing the pattern in an additional pair of parentheses, and specifying one or more *predicate* S-expressions after your main pattern. Predicate S-expressions must start with a predicate name, and contain either `@`-prefixed capture names or strings.
For example, this pattern would match identifier nodes whose names contain only capital letters:
```
((identifier) @constant
(match? @constant "^[A-Z][A-Z_]+"))
```
*Note* - Predicates are not handled directly by the Tree-sitter library. They are just exposed in a structured form so that higher-level code can perform the filtering.
#### The Query API
Create a query by specifying a string containing one or more patterns:
```c
TSQuery *ts_query_new(
const TSLanguage *language,
const char *source,
uint32_t source_len,
uint32_t *error_offset,
TSQueryError *error_type
);
```
If there is an error in the query, then the `error_offset` argument will be set to the byte offset of the error, and the `error_type` argument will be set to a value that indicates the type of error:
```c
typedef enum {
TSQueryErrorNone = 0,
TSQueryErrorSyntax,
TSQueryErrorNodeType,
TSQueryErrorField,
TSQueryErrorCapture,
} TSQueryError;
```
The `TSQuery` value is immutable and can be safely shared between threads. To execute the query, create a `TSQueryCursor`, which carries the state needed for processing the queries. The query cursor should not be shared between threads, but can be reused for many query executions.
```c
TSQueryCursor *ts_query_cursor_new(void);
```
You can then execute the query on a given syntax node:
```c
void ts_query_cursor_exec(TSQueryCursor *, const TSQuery *, TSNode);
```
You can then iterate over the matches:
```c
typedef struct {
uint32_t id;
uint16_t pattern_index;
uint16_t capture_count;
const TSQueryCapture *captures;
} TSQueryMatch;
bool ts_query_cursor_next_match(TSQueryCursor *, TSQueryMatch *match);
```
This function will return `false` when there are no more matches.