2024-12-24 00:43:07 -05:00
|
|
|
# Basic Parsing
|
|
|
|
|
|
|
|
|
|
## Providing the Code
|
|
|
|
|
|
|
|
|
|
In the example on the previous page, we parsed source code stored in a simple string using the `ts_parser_parse_string` function:
|
|
|
|
|
|
|
|
|
|
```c
|
|
|
|
|
TSTree *ts_parser_parse_string(
|
|
|
|
|
TSParser *self,
|
|
|
|
|
const TSTree *old_tree,
|
|
|
|
|
const char *string,
|
|
|
|
|
uint32_t length
|
|
|
|
|
);
|
|
|
|
|
```
|
|
|
|
|
|
|
|
|
|
You may want to parse source code that's stored in a custom data structure, like a [piece table][piece table] or a [rope][rope].
|
|
|
|
|
In this case, you can use the more general `ts_parser_parse` function:
|
|
|
|
|
|
|
|
|
|
```c
|
|
|
|
|
TSTree *ts_parser_parse(
|
|
|
|
|
TSParser *self,
|
|
|
|
|
const TSTree *old_tree,
|
|
|
|
|
TSInput input
|
|
|
|
|
);
|
|
|
|
|
```
|
|
|
|
|
|
|
|
|
|
The `TSInput` structure lets you provide your own function for reading a chunk of text at a given byte offset and row/column
|
|
|
|
|
position. The function can return text encoded in either UTF-8 or UTF-16. This interface allows you to efficiently parse
|
|
|
|
|
text that is stored in your own data structure.
|
|
|
|
|
|
|
|
|
|
```c
|
|
|
|
|
typedef struct {
|
|
|
|
|
void *payload;
|
|
|
|
|
const char *(*read)(
|
|
|
|
|
void *payload,
|
|
|
|
|
uint32_t byte_offset,
|
|
|
|
|
TSPoint position,
|
|
|
|
|
uint32_t *bytes_read
|
|
|
|
|
);
|
|
|
|
|
TSInputEncoding encoding;
|
2025-08-23 10:18:15 +03:00
|
|
|
TSDecodeFunction decode;
|
2024-12-24 00:43:07 -05:00
|
|
|
} TSInput;
|
|
|
|
|
```
|
|
|
|
|
|
|
|
|
|
If you want to decode text that is not encoded in UTF-8 or UTF-16, you can set the `decode` field of the input to your function
|
2025-08-23 10:18:15 +03:00
|
|
|
that will decode text. The signature of the `TSDecodeFunction` is as follows:
|
2024-12-24 00:43:07 -05:00
|
|
|
|
|
|
|
|
```c
|
2025-08-23 10:18:15 +03:00
|
|
|
typedef uint32_t (*TSDecodeFunction)(
|
2024-12-24 00:43:07 -05:00
|
|
|
const uint8_t *string,
|
|
|
|
|
uint32_t length,
|
|
|
|
|
int32_t *code_point
|
|
|
|
|
);
|
|
|
|
|
```
|
|
|
|
|
|
2025-01-10 02:29:32 -05:00
|
|
|
```admonish attention
|
|
|
|
|
The `TSInputEncoding` must be set to `TSInputEncodingCustom` for the `decode` function to be called.
|
|
|
|
|
```
|
2024-12-24 00:43:07 -05:00
|
|
|
|
|
|
|
|
The `string` argument is a pointer to the text to decode, which comes from the `read` function, and the `length` argument
|
|
|
|
|
is the length of the `string`. The `code_point` argument is a pointer to an integer that represents the decoded code point,
|
|
|
|
|
and should be written to in your `decode` callback. The function should return the number of bytes decoded.
|
|
|
|
|
|
|
|
|
|
## Syntax Nodes
|
|
|
|
|
|
|
|
|
|
Tree-sitter provides a [DOM][dom]-style interface for inspecting syntax trees.
|
|
|
|
|
A syntax node's _type_ is a string that indicates which grammar rule the node represents.
|
|
|
|
|
|
|
|
|
|
```c
|
|
|
|
|
const char *ts_node_type(TSNode);
|
|
|
|
|
```
|
|
|
|
|
|
2024-12-28 16:12:45 -05:00
|
|
|
Syntax nodes store their position in the source code both in raw bytes and row/column
|
|
|
|
|
coordinates. In a point, rows and columns are zero-based. The `row` field represents
|
|
|
|
|
the number of newlines before a given position, while `column` represents the number
|
|
|
|
|
of bytes between the position and beginning of the line.
|
2024-12-24 00:43:07 -05:00
|
|
|
|
|
|
|
|
```c
|
|
|
|
|
uint32_t ts_node_start_byte(TSNode);
|
|
|
|
|
uint32_t ts_node_end_byte(TSNode);
|
|
|
|
|
typedef struct {
|
|
|
|
|
uint32_t row;
|
|
|
|
|
uint32_t column;
|
|
|
|
|
} TSPoint;
|
|
|
|
|
TSPoint ts_node_start_point(TSNode);
|
|
|
|
|
TSPoint ts_node_end_point(TSNode);
|
|
|
|
|
```
|
|
|
|
|
|
2025-08-28 13:31:25 -04:00
|
|
|
```admonish note
|
|
|
|
|
A *newline* is considered to be a single line feed (`\n`) character.
|
|
|
|
|
```
|
|
|
|
|
|
2024-12-24 00:43:07 -05:00
|
|
|
## Retrieving Nodes
|
|
|
|
|
|
|
|
|
|
Every tree has a _root node_:
|
|
|
|
|
|
|
|
|
|
```c
|
|
|
|
|
TSNode ts_tree_root_node(const TSTree *);
|
|
|
|
|
```
|
|
|
|
|
|
|
|
|
|
Once you have a node, you can access the node's children:
|
|
|
|
|
|
|
|
|
|
```c
|
|
|
|
|
uint32_t ts_node_child_count(TSNode);
|
|
|
|
|
TSNode ts_node_child(TSNode, uint32_t);
|
|
|
|
|
```
|
|
|
|
|
|
|
|
|
|
You can also access its siblings and parent:
|
|
|
|
|
|
|
|
|
|
```c
|
|
|
|
|
TSNode ts_node_next_sibling(TSNode);
|
|
|
|
|
TSNode ts_node_prev_sibling(TSNode);
|
|
|
|
|
TSNode ts_node_parent(TSNode);
|
|
|
|
|
```
|
|
|
|
|
|
|
|
|
|
These methods may all return a _null node_ to indicate, for example, that a node does not _have_ a next sibling.
|
|
|
|
|
You can check if a node is null:
|
|
|
|
|
|
|
|
|
|
```c
|
|
|
|
|
bool ts_node_is_null(TSNode);
|
|
|
|
|
```
|
|
|
|
|
|
|
|
|
|
## Named vs Anonymous Nodes
|
|
|
|
|
|
|
|
|
|
Tree-sitter produces [_concrete_ syntax trees][cst] — trees that contain nodes for
|
|
|
|
|
every individual token in the source code, including things like commas and parentheses. This is important for use-cases
|
|
|
|
|
that deal with individual tokens, like [syntax highlighting][syntax highlighting]. But some
|
|
|
|
|
types of code analysis are easier to perform using an [_abstract_ syntax tree][ast] — a tree in which the less important
|
|
|
|
|
details have been removed. Tree-sitter's trees support these use cases by making a distinction between
|
|
|
|
|
_named_ and _anonymous_ nodes.
|
|
|
|
|
|
|
|
|
|
Consider a grammar rule like this:
|
|
|
|
|
|
|
|
|
|
```js
|
|
|
|
|
if_statement: $ => seq("if", "(", $._expression, ")", $._statement);
|
|
|
|
|
```
|
|
|
|
|
|
|
|
|
|
A syntax node representing an `if_statement` in this language would have 5 children: the condition expression, the body statement,
|
|
|
|
|
as well as the `if`, `(`, and `)` tokens. The expression and the statement would be marked as _named_ nodes, because they
|
|
|
|
|
have been given explicit names in the grammar. But the `if`, `(`, and `)` nodes would _not_ be named nodes, because they
|
|
|
|
|
are represented in the grammar as simple strings.
|
|
|
|
|
|
|
|
|
|
You can check whether any given node is named:
|
|
|
|
|
|
|
|
|
|
```c
|
|
|
|
|
bool ts_node_is_named(TSNode);
|
|
|
|
|
```
|
|
|
|
|
|
|
|
|
|
When traversing the tree, you can also choose to skip over anonymous nodes by using the `_named_` variants of all of the
|
|
|
|
|
methods described above:
|
|
|
|
|
|
|
|
|
|
```c
|
|
|
|
|
TSNode ts_node_named_child(TSNode, uint32_t);
|
|
|
|
|
uint32_t ts_node_named_child_count(TSNode);
|
|
|
|
|
TSNode ts_node_next_named_sibling(TSNode);
|
|
|
|
|
TSNode ts_node_prev_named_sibling(TSNode);
|
|
|
|
|
```
|
|
|
|
|
|
|
|
|
|
If you use this group of methods, the syntax tree functions much like an abstract syntax tree.
|
|
|
|
|
|
|
|
|
|
## Node Field Names
|
|
|
|
|
|
|
|
|
|
To make syntax nodes easier to analyze, many grammars assign unique _field names_ to particular child nodes.
|
|
|
|
|
In the [creating parsers][using fields] section, it's explained how to do this in your own grammars. If a syntax node has
|
|
|
|
|
fields, you can access its children using their field name:
|
|
|
|
|
|
|
|
|
|
```c
|
|
|
|
|
TSNode ts_node_child_by_field_name(
|
|
|
|
|
TSNode self,
|
|
|
|
|
const char *field_name,
|
|
|
|
|
uint32_t field_name_length
|
|
|
|
|
);
|
|
|
|
|
```
|
|
|
|
|
|
|
|
|
|
Fields also have numeric ids that you can use, if you want to avoid repeated string comparisons. You can convert between
|
|
|
|
|
strings and ids using the `TSLanguage`:
|
|
|
|
|
|
|
|
|
|
```c
|
|
|
|
|
uint32_t ts_language_field_count(const TSLanguage *);
|
|
|
|
|
const char *ts_language_field_name_for_id(const TSLanguage *, TSFieldId);
|
|
|
|
|
TSFieldId ts_language_field_id_for_name(const TSLanguage *, const char *, uint32_t);
|
|
|
|
|
```
|
|
|
|
|
|
|
|
|
|
The field ids can be used in place of the name:
|
|
|
|
|
|
|
|
|
|
```c
|
|
|
|
|
TSNode ts_node_child_by_field_id(TSNode, TSFieldId);
|
|
|
|
|
```
|
|
|
|
|
|
|
|
|
|
[ast]: https://en.wikipedia.org/wiki/Abstract_syntax_tree
|
|
|
|
|
[cst]: https://en.wikipedia.org/wiki/Parse_tree
|
|
|
|
|
[dom]: https://en.wikipedia.org/wiki/Document_Object_Model
|
|
|
|
|
[piece table]: <https://en.wikipedia.org/wiki/Piece_table>
|
|
|
|
|
[rope]: <https://en.wikipedia.org/wiki/Rope_(data_structure)>
|
|
|
|
|
[syntax highlighting]: https://en.wikipedia.org/wiki/Syntax_highlighting
|
|
|
|
|
[using fields]: ../creating-parsers/3-writing-the-grammar.md#using-fields
|