* Deal with mergeability outside of error comparison function
* Make `better_version_exists` function pure (don't halt other versions
as a side effect).
* Tweak error comparison logic
Signed-off-by: Rick Winfrey <rewinfrey@github.com>
The previous approach to error recovery relied on special error-recovery
states in the parse table. For each token T, there was an error recovery
state in which the parser looked for *any* token that could follow T.
Unfortunately, sometimes the set of tokens that could follow T contained
conflicts. For example, in JS, the token '}' can be followed by the
open-ended 'template_chars' token, but also by ordinary tokens like
'identifier'. So with the old algorithm, when recovering from an
unexpected '}' token, the lexer had no way to distinguish identifiers
from template_chars.
This commit drops the error recovery states. Instead, when we encounter
an unexpected token T, we recover from the error by finding a previous
state S in the stack in which T would be valid, popping all of the nodes
after S, and wrapping them in an error.
This way, the lexer is always invoked in a normal parse state, in which
it is looking for a non-conflicting set of tokens. Eliminating the error
recovery states also shrinks the lex state machine significantly.
Signed-off-by: Rick Winfrey <rewinfrey@github.com>
We previously maintained a set of individual productions that were
involved in conflicts, but that was subtly incorrect because
we don't compare productions themselves when comparing parse items;
we only compare the parse items properties that could affect the
final reduce actions.
SpyInput uses a fixed-size buffer and explicitly zeros memory which is good for
catching logic errors but defeats valgrind's memory tracking. Use a separate
buffer of exactly the correct size for each request. This correctly catches the
problem under valgrind:
```
==8694== Invalid read of size 2
==8694== at 0x54EFFB: utf16_iterate (utf16.c:10)
==8694== by 0x551126: ts_lexer__get_lookahead (lexer.c:54)
==8694== by 0x5515CD: ts_lexer_start (lexer.c:154)
==8694== by 0x54699F: parser(long,...)(long long) (parser.c:297)
==8694== by 0x54788A: parser__get_lookahead (parser.c:439)
==8694== by 0x54B2D3: parser__advance (parser.c:1150)
==8694== by 0x54C2AA: parser_parse (parser.c:1348)
==8694== by 0x53F063: ts_document_parse_with_options (document.c:136)
==8694== by 0x53EF43: ts_document_parse (document.c:107)
==8694== by 0x4AED11: {lambda()#1}::operator()() const::{lambda()#1}::operator()() const::{lambda()#4}::operator()() const::{lambda()#4}::operator()() const (document_test.cc:82)
==8694== by 0x4B56B6: std::_Function_handler<void (), {lambda()#1}::operator()() const::{lambda()#1}::operator()() const::{lambda()#4}::operator()() const::{lambda()#4}>::_M_invoke(std::_Any_data const&) (functional:1871)
==8694== by 0x40F8C5: std::function<void ()>::operator()() const (functional:2267)
==8694== Address 0x5d08be0 is 0 bytes inside a block of size 1 alloc'd
==8694== at 0x4C2E80F: operator new[](unsigned long) (in /usr/lib/valgrind/vgpreload_memcheck-amd64-linux.so)
==8694== by 0x507C3E: SpyInput::read(void*, unsigned int*) (spy_input.cc:66)
==8694== by 0x55103D: ts_lexer__get_chunk (lexer.c:29)
==8694== by 0x5515B6: ts_lexer_start (lexer.c:152)
==8694== by 0x54699F: parser(long,...)(long long) (parser.c:297)
==8694== by 0x54788A: parser__get_lookahead (parser.c:439)
==8694== by 0x54B2D3: parser__advance (parser.c:1150)
==8694== by 0x54C2AA: parser_parse (parser.c:1348)
==8694== by 0x53F063: ts_document_parse_with_options (document.c:136)
==8694== by 0x53EF43: ts_document_parse (document.c:107)
==8694== by 0x4AED11: {lambda()#1}::operator()() const::{lambda()#1}::operator()() const::{lambda()#4}::operator()() const::{lambda()#4}::operator()() const (document_test.cc:82)
==8694== by 0x4B56B6: std::_Function_handler<void (), {lambda()#1}::operator()() const::{lambda()#1}::operator()() const::{lambda()#4}::operator()() const::{lambda()#4}>::_M_invoke(std::_Any_data const&) (functional:1871)
```
Also simplify the test so we call `utf16_iterate` directly. Calling
`utf16_iterate` via `SpyInput` and `ts_document_parse` doesn't seem to reliably
trigger the problem using valgrind.
valgrind also doesn't detect the problem if we use a string literal like:
`utf16_iterate("", 1, &code_point);`
utf16_iterate does not check that 'length' is a multiple of two which leads to
an out-of-bound read:
==105293== Conditional jump or move depends on uninitialised value(s)
==105293== at 0x54F014: utf16_iterate (utf16.c:7)
==105293== by 0x539251: string_iterate(TSInputEncoding, unsigned char const*, unsigned long, int*) (encoding_helpers.cc:15)
==105293== by 0x53939D: string_byte_for_character(TSInputEncoding, std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > const&, unsigned long, unsigned long) (encoding_helpers.cc:43)
==105293== by 0x507BAD: SpyInput::read(void*, unsigned int*) (spy_input.cc:47)
==105293== by 0x551049: ts_lexer__get_chunk (lexer.c:29)
==105293== by 0x5515C2: ts_lexer_start (lexer.c:152)
==105293== by 0x5469AB: parser(long,...)(long long) (parser.c:297)
==105293== by 0x547896: parser__get_lookahead (parser.c:439)
==105293== by 0x54B2DF: parser__advance (parser.c:1150)
==105293== by 0x54C2B6: parser_parse (parser.c:1348)
==105293== by 0x53F06F: ts_document_parse_with_options (document.c:136)
==105293== by 0x53EF4F: ts_document_parse (document.c:107)
This adds support for fuzzing tree-sitter grammars with libFuzzer. This
currently only works on Linux because of linking issues on macOS. Breifly, the
AddressSanitizer library is dynamically linked into the fuzzer binary and
cannot be found at runtime if built with a compiler that wasn't provided by
Xcode(?). The runtime library is statically linked on Linux so this isn't a
problem.