The `pos` and `size` functions for Nodes now return TSLength structs,
which contain lengths in both characters and bytes. This is important
for knowing the number of unicode characters in a Node.
This improves test coverage of the lexer. Before, a SpyReader's read function
would return pointers into a single string that contained the entire text. This
could have masked bugs where out-of-bounds characters were being read.
Now the chunks returned by the reader are copied into a separate buffer.
Previously, the way repeat rules were expanded, the auxiliary
rule always needed to be reduced, even if the repeating content
was empty. This caused problems in parse states where some items
contained the repeat rule and some did not. To make those cases
work, the repeat rule had to explicitly be marked as optional.
With this change, that is no longer necessary.
This reverts commit 5cd07648fd.
The separators construct is useful as an optimization. It turns out that
constructing a node for every chunk of whitespace in a document causes a
significant performance regression.
Conflicts:
src/compiler/build_tables/build_lex_table.cc
src/compiler/grammar.cc
src/runtime/parser.c
When breaking down the stack in parser.c, the previous code
would not account for ubiquitous tokens. This was a problem
for a long time, but wasn't noticed until ubiquitous tokens
started being used to represent separator characters
Now, grammars can handle whitespace by making it another ubiquitous
token, like comments.
For now, this has the side effect of whitespace being included in the
tree that precedes it. This was already an issue for other ubiquitous
tokens though, so it needs to be fixed anyway.
- Node position is public. It represents the node's first character
index in the document.
- Tree offset is private. It represents the distance between the tree's
first character index and it's parent's first character index.
- Tree padding is private. It represents the amount of whitespace
(or other separator characters) immediately preceding the tree.
Previously, if an error happened right at the beginning of an error
production, the error node would be immediately shifted onto the stack
without calling the error handling function.
The lexer doesn't know the expected symbols, so it doesn't have enough
information to construct error nodes. Now, when it encounters an invalid
character, it returns NULL and the parser builds a correct error node.
Now, the root node of a document is always a document node.
It will often have only one child node which corresponds to the grammar's
start symbol, but not always. Currently, it may have more than one child
if there are ubiquitous tokens such as comments at the beginning of the
document. In the future, it will also be possible be possible to have multiple
for the document to have multiple children if the document is partially parsed.
There's no need for a `string` function since one already
exists for Nodes.
Now the root node is always stored on the document. This
means callers of `ts_document_root_node` don't need to release
its return value.