They were a vestige of when Tree-sitter did sentential form-based
incremental parsing (as opposed to simply state matching). This was
elegant but not compatible with GLR as far as I could tell.
* Deal with mergeability outside of error comparison function
* Make `better_version_exists` function pure (don't halt other versions
as a side effect).
* Tweak error comparison logic
Signed-off-by: Rick Winfrey <rewinfrey@github.com>
The previous approach to error recovery relied on special error-recovery
states in the parse table. For each token T, there was an error recovery
state in which the parser looked for *any* token that could follow T.
Unfortunately, sometimes the set of tokens that could follow T contained
conflicts. For example, in JS, the token '}' can be followed by the
open-ended 'template_chars' token, but also by ordinary tokens like
'identifier'. So with the old algorithm, when recovering from an
unexpected '}' token, the lexer had no way to distinguish identifiers
from template_chars.
This commit drops the error recovery states. Instead, when we encounter
an unexpected token T, we recover from the error by finding a previous
state S in the stack in which T would be valid, popping all of the nodes
after S, and wrapping them in an error.
This way, the lexer is always invoked in a normal parse state, in which
it is looking for a non-conflicting set of tokens. Eliminating the error
recovery states also shrinks the lex state machine significantly.
Signed-off-by: Rick Winfrey <rewinfrey@github.com>
We previously maintained a set of individual productions that were
involved in conflicts, but that was subtly incorrect because
we don't compare productions themselves when comparing parse items;
we only compare the parse items properties that could affect the
final reduce actions.
SpyInput uses a fixed-size buffer and explicitly zeros memory which is good for
catching logic errors but defeats valgrind's memory tracking. Use a separate
buffer of exactly the correct size for each request. This correctly catches the
problem under valgrind:
```
==8694== Invalid read of size 2
==8694== at 0x54EFFB: utf16_iterate (utf16.c:10)
==8694== by 0x551126: ts_lexer__get_lookahead (lexer.c:54)
==8694== by 0x5515CD: ts_lexer_start (lexer.c:154)
==8694== by 0x54699F: parser(long,...)(long long) (parser.c:297)
==8694== by 0x54788A: parser__get_lookahead (parser.c:439)
==8694== by 0x54B2D3: parser__advance (parser.c:1150)
==8694== by 0x54C2AA: parser_parse (parser.c:1348)
==8694== by 0x53F063: ts_document_parse_with_options (document.c:136)
==8694== by 0x53EF43: ts_document_parse (document.c:107)
==8694== by 0x4AED11: {lambda()#1}::operator()() const::{lambda()#1}::operator()() const::{lambda()#4}::operator()() const::{lambda()#4}::operator()() const (document_test.cc:82)
==8694== by 0x4B56B6: std::_Function_handler<void (), {lambda()#1}::operator()() const::{lambda()#1}::operator()() const::{lambda()#4}::operator()() const::{lambda()#4}>::_M_invoke(std::_Any_data const&) (functional:1871)
==8694== by 0x40F8C5: std::function<void ()>::operator()() const (functional:2267)
==8694== Address 0x5d08be0 is 0 bytes inside a block of size 1 alloc'd
==8694== at 0x4C2E80F: operator new[](unsigned long) (in /usr/lib/valgrind/vgpreload_memcheck-amd64-linux.so)
==8694== by 0x507C3E: SpyInput::read(void*, unsigned int*) (spy_input.cc:66)
==8694== by 0x55103D: ts_lexer__get_chunk (lexer.c:29)
==8694== by 0x5515B6: ts_lexer_start (lexer.c:152)
==8694== by 0x54699F: parser(long,...)(long long) (parser.c:297)
==8694== by 0x54788A: parser__get_lookahead (parser.c:439)
==8694== by 0x54B2D3: parser__advance (parser.c:1150)
==8694== by 0x54C2AA: parser_parse (parser.c:1348)
==8694== by 0x53F063: ts_document_parse_with_options (document.c:136)
==8694== by 0x53EF43: ts_document_parse (document.c:107)
==8694== by 0x4AED11: {lambda()#1}::operator()() const::{lambda()#1}::operator()() const::{lambda()#4}::operator()() const::{lambda()#4}::operator()() const (document_test.cc:82)
==8694== by 0x4B56B6: std::_Function_handler<void (), {lambda()#1}::operator()() const::{lambda()#1}::operator()() const::{lambda()#4}::operator()() const::{lambda()#4}>::_M_invoke(std::_Any_data const&) (functional:1871)
```
Also simplify the test so we call `utf16_iterate` directly. Calling
`utf16_iterate` via `SpyInput` and `ts_document_parse` doesn't seem to reliably
trigger the problem using valgrind.
valgrind also doesn't detect the problem if we use a string literal like:
`utf16_iterate("", 1, &code_point);`
utf16_iterate does not check that 'length' is a multiple of two which leads to
an out-of-bound read:
==105293== Conditional jump or move depends on uninitialised value(s)
==105293== at 0x54F014: utf16_iterate (utf16.c:7)
==105293== by 0x539251: string_iterate(TSInputEncoding, unsigned char const*, unsigned long, int*) (encoding_helpers.cc:15)
==105293== by 0x53939D: string_byte_for_character(TSInputEncoding, std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > const&, unsigned long, unsigned long) (encoding_helpers.cc:43)
==105293== by 0x507BAD: SpyInput::read(void*, unsigned int*) (spy_input.cc:47)
==105293== by 0x551049: ts_lexer__get_chunk (lexer.c:29)
==105293== by 0x5515C2: ts_lexer_start (lexer.c:152)
==105293== by 0x5469AB: parser(long,...)(long long) (parser.c:297)
==105293== by 0x547896: parser__get_lookahead (parser.c:439)
==105293== by 0x54B2DF: parser__advance (parser.c:1150)
==105293== by 0x54C2B6: parser_parse (parser.c:1348)
==105293== by 0x53F06F: ts_document_parse_with_options (document.c:136)
==105293== by 0x53EF4F: ts_document_parse (document.c:107)