tree-sitter/docs/src/using-parsers/2-basic-parsing.md

# Basic Parsing

## Providing the Code

In the example on the previous page, we parsed source code stored in a simple string using the `ts_parser_parse_string` function:

```c
TSTree *ts_parser_parse_string(
  TSParser *self,
  const TSTree *old_tree,
  const char *string,
  uint32_t length
);
```

You may want to parse source code that's stored in a custom data structure, like a [piece table][piece table] or a [rope][rope].
In this case, you can use the more general `ts_parser_parse` function:

```c
TSTree *ts_parser_parse(
  TSParser *self,
  const TSTree *old_tree,
  TSInput input
);
```

The `TSInput` structure lets you provide your own function for reading a chunk of text at a given byte offset and row/column
position. The function can return text encoded in either UTF-8 or UTF-16. This interface allows you to efficiently parse
text that is stored in your own data structure.

```c
typedef struct {
  void *payload;
  const char *(*read)(
    void *payload,
    uint32_t byte_offset,
    TSPoint position,
    uint32_t *bytes_read
  );
  TSInputEncoding encoding;
  TSDecodeFunction decode;
} TSInput;
```

If you want to decode text that is not encoded in UTF-8 or UTF-16, you can set the `decode` field of the input to your function
that will decode text. The signature of the `TSDecodeFunction` is as follows:

```c
typedef uint32_t (*TSDecodeFunction)(
  const uint8_t *string,
  uint32_t length,
  int32_t *code_point
);
```

```admonish attention
The `TSInputEncoding` must be set to `TSInputEncodingCustom` for the `decode` function to be called.
```

The `string` argument is a pointer to the text to decode, which comes from the `read` function, and the `length` argument
is the length of the `string`. The `code_point` argument is a pointer to an integer that represents the decoded code point,
and should be written to in your `decode` callback. The function should return the number of bytes decoded.

## Syntax Nodes

Tree-sitter provides a [DOM][dom]-style interface for inspecting syntax trees.
A syntax node's _type_ is a string that indicates which grammar rule the node represents.

```c
const char *ts_node_type(TSNode);
```

Syntax nodes store their position in the source code both in raw bytes and row/column
coordinates. In a point, rows and columns are zero-based. The `row` field represents
the number of newlines before a given position, while `column` represents the number
of bytes between the position and beginning of the line.

```c
uint32_t ts_node_start_byte(TSNode);
uint32_t ts_node_end_byte(TSNode);
typedef struct {
  uint32_t row;
  uint32_t column;
} TSPoint;
TSPoint ts_node_start_point(TSNode);
TSPoint ts_node_end_point(TSNode);
```

```admonish note
A *newline* is considered to be a single line feed (`\n`) character.
```

## Retrieving Nodes

Every tree has a _root node_:

```c
TSNode ts_tree_root_node(const TSTree *);
```

Once you have a node, you can access the node's children:

```c
uint32_t ts_node_child_count(TSNode);
TSNode ts_node_child(TSNode, uint32_t);
```

You can also access its siblings and parent:

```c
TSNode ts_node_next_sibling(TSNode);
TSNode ts_node_prev_sibling(TSNode);
TSNode ts_node_parent(TSNode);
```

These methods may all return a _null node_ to indicate, for example, that a node does not _have_ a next sibling.
You can check if a node is null:

```c
bool ts_node_is_null(TSNode);
```

## Named vs Anonymous Nodes

Tree-sitter produces [_concrete_ syntax trees][cst] — trees that contain nodes for
every individual token in the source code, including things like commas and parentheses. This is important for use-cases
that deal with individual tokens, like [syntax highlighting][syntax highlighting]. But some
types of code analysis are easier to perform using an [_abstract_ syntax tree][ast] — a tree in which the less important
details have been removed. Tree-sitter's trees support these use cases by making a distinction between
_named_ and _anonymous_ nodes.

Consider a grammar rule like this:

```js
if_statement: $ => seq("if", "(", $._expression, ")", $._statement);
```

A syntax node representing an `if_statement` in this language would have 5 children: the condition expression, the body statement,
as well as the `if`, `(`, and `)` tokens. The expression and the statement would be marked as _named_ nodes, because they
have been given explicit names in the grammar. But the `if`, `(`, and `)` nodes would _not_ be named nodes, because they
are represented in the grammar as simple strings.

You can check whether any given node is named:

```c
bool ts_node_is_named(TSNode);
```

When traversing the tree, you can also choose to skip over anonymous nodes by using the `_named_` variants of all of the
methods described above:

```c
TSNode ts_node_named_child(TSNode, uint32_t);
uint32_t ts_node_named_child_count(TSNode);
TSNode ts_node_next_named_sibling(TSNode);
TSNode ts_node_prev_named_sibling(TSNode);
```

If you use this group of methods, the syntax tree functions much like an abstract syntax tree.

## Node Field Names

To make syntax nodes easier to analyze, many grammars assign unique _field names_ to particular child nodes.
In the [creating parsers][using fields] section, it's explained how to do this in your own grammars. If a syntax node has
fields, you can access its children using their field name:

```c
TSNode ts_node_child_by_field_name(
  TSNode self,
  const char *field_name,
  uint32_t field_name_length
);
```

Fields also have numeric ids that you can use, if you want to avoid repeated string comparisons. You can convert between
strings and ids using the `TSLanguage`:

```c
uint32_t ts_language_field_count(const TSLanguage *);
const char *ts_language_field_name_for_id(const TSLanguage *, TSFieldId);
TSFieldId ts_language_field_id_for_name(const TSLanguage *, const char *, uint32_t);
```

The field ids can be used in place of the name:

```c
TSNode ts_node_child_by_field_id(TSNode, TSFieldId);
```

[ast]: https://en.wikipedia.org/wiki/Abstract_syntax_tree
[cst]: https://en.wikipedia.org/wiki/Parse_tree
[dom]: https://en.wikipedia.org/wiki/Document_Object_Model
[piece table]: <https://en.wikipedia.org/wiki/Piece_table>
[rope]: <https://en.wikipedia.org/wiki/Rope_(data_structure)>
[syntax highlighting]: https://en.wikipedia.org/wiki/Syntax_highlighting
[using fields]: ../creating-parsers/3-writing-the-grammar.md#using-fields
docs: migrate to mdbook 2024-12-24 00:43:07 -05:00			`# Basic Parsing`

			`## Providing the Code`

			In the example on the previous page, we parsed source code stored in a simple string using the `ts_parser_parse_string` function:

			```c
			`TSTree *ts_parser_parse_string(`
			`TSParser *self,`
			`const TSTree *old_tree,`
			`const char *string,`
			`uint32_t length`
			`);`
			```

			`You may want to parse source code that's stored in a custom data structure, like a [piece table][piece table] or a [rope][rope].`
			In this case, you can use the more general `ts_parser_parse` function:

			```c
			`TSTree *ts_parser_parse(`
			`TSParser *self,`
			`const TSTree *old_tree,`
			`TSInput input`
			`);`
			```

			The `TSInput` structure lets you provide your own function for reading a chunk of text at a given byte offset and row/column
			`position. The function can return text encoded in either UTF-8 or UTF-16. This interface allows you to efficiently parse`
			`text that is stored in your own data structure.`

			```c
			`typedef struct {`
			`void *payload;`
			`const char (read)(`
			`void *payload,`
			`uint32_t byte_offset,`
			`TSPoint position,`
			`uint32_t *bytes_read`
			`);`
			`TSInputEncoding encoding;`
feat(c): rename DecodeFunction to TSDecodeFunction Keep a typedef for backwards compatibility until ABI 16. 2025-08-23 10:18:15 +03:00			`TSDecodeFunction decode;`
docs: migrate to mdbook 2024-12-24 00:43:07 -05:00			`} TSInput;`
			```

			If you want to decode text that is not encoded in UTF-8 or UTF-16, you can set the `decode` field of the input to your function
feat(c): rename DecodeFunction to TSDecodeFunction Keep a typedef for backwards compatibility until ABI 16. 2025-08-23 10:18:15 +03:00			that will decode text. The signature of the `TSDecodeFunction` is as follows:
docs: migrate to mdbook 2024-12-24 00:43:07 -05:00
			```c
feat(c): rename DecodeFunction to TSDecodeFunction Keep a typedef for backwards compatibility until ABI 16. 2025-08-23 10:18:15 +03:00			`typedef uint32_t (*TSDecodeFunction)(`
docs: migrate to mdbook 2024-12-24 00:43:07 -05:00			`const uint8_t *string,`
			`uint32_t length,`
			`int32_t *code_point`
			`);`
			```

docs: add mdbook-admonish for better tooltips 2025-01-10 02:29:32 -05:00			```admonish attention
			The `TSInputEncoding` must be set to `TSInputEncodingCustom` for the `decode` function to be called.
			```
docs: migrate to mdbook 2024-12-24 00:43:07 -05:00
			The `string` argument is a pointer to the text to decode, which comes from the `read` function, and the `length` argument
			is the length of the `string`. The `code_point` argument is a pointer to an integer that represents the decoded code point,
			and should be written to in your `decode` callback. The function should return the number of bytes decoded.

			`## Syntax Nodes`

			`Tree-sitter provides a [DOM][dom]-style interface for inspecting syntax trees.`
			`A syntax node's _type_ is a string that indicates which grammar rule the node represents.`

			```c
			`const char *ts_node_type(TSNode);`
			```

docs: add back detailed info about points 2024-12-28 16:12:45 -05:00			`Syntax nodes store their position in the source code both in raw bytes and row/column`
			coordinates. In a point, rows and columns are zero-based. The `row` field represents
			the number of newlines before a given position, while `column` represents the number
			`of bytes between the position and beginning of the line.`
docs: migrate to mdbook 2024-12-24 00:43:07 -05:00
			```c
			`uint32_t ts_node_start_byte(TSNode);`
			`uint32_t ts_node_end_byte(TSNode);`
			`typedef struct {`
			`uint32_t row;`
			`uint32_t column;`
			`} TSPoint;`
			`TSPoint ts_node_start_point(TSNode);`
			`TSPoint ts_node_end_point(TSNode);`
			```

docs: clarify definition of newline 2025-08-28 13:31:25 -04:00			```admonish note
			A newline is considered to be a single line feed (`\n`) character.
			```

docs: migrate to mdbook 2024-12-24 00:43:07 -05:00			`## Retrieving Nodes`

			`Every tree has a _root node_:`

			```c
			`TSNode ts_tree_root_node(const TSTree *);`
			```

			`Once you have a node, you can access the node's children:`

			```c
			`uint32_t ts_node_child_count(TSNode);`
			`TSNode ts_node_child(TSNode, uint32_t);`
			```

			`You can also access its siblings and parent:`

			```c
			`TSNode ts_node_next_sibling(TSNode);`
			`TSNode ts_node_prev_sibling(TSNode);`
			`TSNode ts_node_parent(TSNode);`
			```

			`These methods may all return a _null node_ to indicate, for example, that a node does not _have_ a next sibling.`
			`You can check if a node is null:`

			```c
			`bool ts_node_is_null(TSNode);`
			```

			`## Named vs Anonymous Nodes`

			`Tree-sitter produces [_concrete_ syntax trees][cst] — trees that contain nodes for`
			`every individual token in the source code, including things like commas and parentheses. This is important for use-cases`
			`that deal with individual tokens, like [syntax highlighting][syntax highlighting]. But some`
			`types of code analysis are easier to perform using an [_abstract_ syntax tree][ast] — a tree in which the less important`
			`details have been removed. Tree-sitter's trees support these use cases by making a distinction between`
			`_named_ and _anonymous_ nodes.`

			`Consider a grammar rule like this:`

			```js
			`if_statement: $ => seq("if", "(", $._expression, ")", $._statement);`
			```

			A syntax node representing an `if_statement` in this language would have 5 children: the condition expression, the body statement,
			as well as the `if`, `(`, and `)` tokens. The expression and the statement would be marked as _named_ nodes, because they
			have been given explicit names in the grammar. But the `if`, `(`, and `)` nodes would _not_ be named nodes, because they
			`are represented in the grammar as simple strings.`

			`You can check whether any given node is named:`

			```c
			`bool ts_node_is_named(TSNode);`
			```

			When traversing the tree, you can also choose to skip over anonymous nodes by using the `_named_` variants of all of the
			`methods described above:`

			```c
			`TSNode ts_node_named_child(TSNode, uint32_t);`
			`uint32_t ts_node_named_child_count(TSNode);`
			`TSNode ts_node_next_named_sibling(TSNode);`
			`TSNode ts_node_prev_named_sibling(TSNode);`
			```

			`If you use this group of methods, the syntax tree functions much like an abstract syntax tree.`

			`## Node Field Names`

			`To make syntax nodes easier to analyze, many grammars assign unique _field names_ to particular child nodes.`
			`In the [creating parsers][using fields] section, it's explained how to do this in your own grammars. If a syntax node has`
			`fields, you can access its children using their field name:`

			```c
			`TSNode ts_node_child_by_field_name(`
			`TSNode self,`
			`const char *field_name,`
			`uint32_t field_name_length`
			`);`
			```

			`Fields also have numeric ids that you can use, if you want to avoid repeated string comparisons. You can convert between`
			strings and ids using the `TSLanguage`:

			```c
			`uint32_t ts_language_field_count(const TSLanguage *);`
			`const char ts_language_field_name_for_id(const TSLanguage , TSFieldId);`
			`TSFieldId ts_language_field_id_for_name(const TSLanguage , const char , uint32_t);`
			```

			`The field ids can be used in place of the name:`

			```c
			`TSNode ts_node_child_by_field_id(TSNode, TSFieldId);`
			```

			`[ast]: https://en.wikipedia.org/wiki/Abstract_syntax_tree`
			`[cst]: https://en.wikipedia.org/wiki/Parse_tree`
			`[dom]: https://en.wikipedia.org/wiki/Document_Object_Model`
			`[piece table]: <https://en.wikipedia.org/wiki/Piece_table>`
			`[rope]: <https://en.wikipedia.org/wiki/Rope_(data_structure)>`
			`[syntax highlighting]: https://en.wikipedia.org/wiki/Syntax_highlighting`
			`[using fields]: ../creating-parsers/3-writing-the-grammar.md#using-fields`