The previous approach to error recovery relied on special error-recovery
states in the parse table. For each token T, there was an error recovery
state in which the parser looked for *any* token that could follow T.
Unfortunately, sometimes the set of tokens that could follow T contained
conflicts. For example, in JS, the token '}' can be followed by the
open-ended 'template_chars' token, but also by ordinary tokens like
'identifier'. So with the old algorithm, when recovering from an
unexpected '}' token, the lexer had no way to distinguish identifiers
from template_chars.
This commit drops the error recovery states. Instead, when we encounter
an unexpected token T, we recover from the error by finding a previous
state S in the stack in which T would be valid, popping all of the nodes
after S, and wrapping them in an error.
This way, the lexer is always invoked in a normal parse state, in which
it is looking for a non-conflicting set of tokens. Eliminating the error
recovery states also shrinks the lex state machine significantly.
Signed-off-by: Rick Winfrey <rewinfrey@github.com>
We previously maintained a set of individual productions that were
involved in conflicts, but that was subtly incorrect because
we don't compare productions themselves when comparing parse items;
we only compare the parse items properties that could affect the
final reduce actions.
SpyInput uses a fixed-size buffer and explicitly zeros memory which is good for
catching logic errors but defeats valgrind's memory tracking. Use a separate
buffer of exactly the correct size for each request. This correctly catches the
problem under valgrind:
```
==8694== Invalid read of size 2
==8694== at 0x54EFFB: utf16_iterate (utf16.c:10)
==8694== by 0x551126: ts_lexer__get_lookahead (lexer.c:54)
==8694== by 0x5515CD: ts_lexer_start (lexer.c:154)
==8694== by 0x54699F: parser(long,...)(long long) (parser.c:297)
==8694== by 0x54788A: parser__get_lookahead (parser.c:439)
==8694== by 0x54B2D3: parser__advance (parser.c:1150)
==8694== by 0x54C2AA: parser_parse (parser.c:1348)
==8694== by 0x53F063: ts_document_parse_with_options (document.c:136)
==8694== by 0x53EF43: ts_document_parse (document.c:107)
==8694== by 0x4AED11: {lambda()#1}::operator()() const::{lambda()#1}::operator()() const::{lambda()#4}::operator()() const::{lambda()#4}::operator()() const (document_test.cc:82)
==8694== by 0x4B56B6: std::_Function_handler<void (), {lambda()#1}::operator()() const::{lambda()#1}::operator()() const::{lambda()#4}::operator()() const::{lambda()#4}>::_M_invoke(std::_Any_data const&) (functional:1871)
==8694== by 0x40F8C5: std::function<void ()>::operator()() const (functional:2267)
==8694== Address 0x5d08be0 is 0 bytes inside a block of size 1 alloc'd
==8694== at 0x4C2E80F: operator new[](unsigned long) (in /usr/lib/valgrind/vgpreload_memcheck-amd64-linux.so)
==8694== by 0x507C3E: SpyInput::read(void*, unsigned int*) (spy_input.cc:66)
==8694== by 0x55103D: ts_lexer__get_chunk (lexer.c:29)
==8694== by 0x5515B6: ts_lexer_start (lexer.c:152)
==8694== by 0x54699F: parser(long,...)(long long) (parser.c:297)
==8694== by 0x54788A: parser__get_lookahead (parser.c:439)
==8694== by 0x54B2D3: parser__advance (parser.c:1150)
==8694== by 0x54C2AA: parser_parse (parser.c:1348)
==8694== by 0x53F063: ts_document_parse_with_options (document.c:136)
==8694== by 0x53EF43: ts_document_parse (document.c:107)
==8694== by 0x4AED11: {lambda()#1}::operator()() const::{lambda()#1}::operator()() const::{lambda()#4}::operator()() const::{lambda()#4}::operator()() const (document_test.cc:82)
==8694== by 0x4B56B6: std::_Function_handler<void (), {lambda()#1}::operator()() const::{lambda()#1}::operator()() const::{lambda()#4}::operator()() const::{lambda()#4}>::_M_invoke(std::_Any_data const&) (functional:1871)
```
Also simplify the test so we call `utf16_iterate` directly. Calling
`utf16_iterate` via `SpyInput` and `ts_document_parse` doesn't seem to reliably
trigger the problem using valgrind.
valgrind also doesn't detect the problem if we use a string literal like:
`utf16_iterate("", 1, &code_point);`
utf16_iterate does not check that 'length' is a multiple of two which leads to
an out-of-bound read:
==105293== Conditional jump or move depends on uninitialised value(s)
==105293== at 0x54F014: utf16_iterate (utf16.c:7)
==105293== by 0x539251: string_iterate(TSInputEncoding, unsigned char const*, unsigned long, int*) (encoding_helpers.cc:15)
==105293== by 0x53939D: string_byte_for_character(TSInputEncoding, std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > const&, unsigned long, unsigned long) (encoding_helpers.cc:43)
==105293== by 0x507BAD: SpyInput::read(void*, unsigned int*) (spy_input.cc:47)
==105293== by 0x551049: ts_lexer__get_chunk (lexer.c:29)
==105293== by 0x5515C2: ts_lexer_start (lexer.c:152)
==105293== by 0x5469AB: parser(long,...)(long long) (parser.c:297)
==105293== by 0x547896: parser__get_lookahead (parser.c:439)
==105293== by 0x54B2DF: parser__advance (parser.c:1150)
==105293== by 0x54C2B6: parser_parse (parser.c:1348)
==105293== by 0x53F06F: ts_document_parse_with_options (document.c:136)
==105293== by 0x53EF4F: ts_document_parse (document.c:107)
Previously, it was possible for references to external token states to
outlive the trees to which those states belonged.
Now, instead of storing references to external token states in the Stack
and in the Lexer, we store references to the external token trees
themselves, and we retain the trees to prevent use-after-free.
This should prevent any confusing failures in the unit tests:
test/runtime/document_test.cc:381:7: warning: Passed-by-value struct argument contains uninitialized data (e.g., field: 'changed_range_count')
ts_document_parse_with_options(document, options);
^~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
test/runtime/document_test.cc:408:7: warning: Passed-by-value struct argument contains uninitialized data (e.g., field: 'changed_range_count')
ts_document_parse_with_options(document, options);
^~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~