We previously maintained a set of individual productions that were
involved in conflicts, but that was subtly incorrect because
we don't compare productions themselves when comparing parse items;
we only compare the parse items properties that could affect the
final reduce actions.
SpyInput uses a fixed-size buffer and explicitly zeros memory which is good for
catching logic errors but defeats valgrind's memory tracking. Use a separate
buffer of exactly the correct size for each request. This correctly catches the
problem under valgrind:
```
==8694== Invalid read of size 2
==8694== at 0x54EFFB: utf16_iterate (utf16.c:10)
==8694== by 0x551126: ts_lexer__get_lookahead (lexer.c:54)
==8694== by 0x5515CD: ts_lexer_start (lexer.c:154)
==8694== by 0x54699F: parser(long,...)(long long) (parser.c:297)
==8694== by 0x54788A: parser__get_lookahead (parser.c:439)
==8694== by 0x54B2D3: parser__advance (parser.c:1150)
==8694== by 0x54C2AA: parser_parse (parser.c:1348)
==8694== by 0x53F063: ts_document_parse_with_options (document.c:136)
==8694== by 0x53EF43: ts_document_parse (document.c:107)
==8694== by 0x4AED11: {lambda()#1}::operator()() const::{lambda()#1}::operator()() const::{lambda()#4}::operator()() const::{lambda()#4}::operator()() const (document_test.cc:82)
==8694== by 0x4B56B6: std::_Function_handler<void (), {lambda()#1}::operator()() const::{lambda()#1}::operator()() const::{lambda()#4}::operator()() const::{lambda()#4}>::_M_invoke(std::_Any_data const&) (functional:1871)
==8694== by 0x40F8C5: std::function<void ()>::operator()() const (functional:2267)
==8694== Address 0x5d08be0 is 0 bytes inside a block of size 1 alloc'd
==8694== at 0x4C2E80F: operator new[](unsigned long) (in /usr/lib/valgrind/vgpreload_memcheck-amd64-linux.so)
==8694== by 0x507C3E: SpyInput::read(void*, unsigned int*) (spy_input.cc:66)
==8694== by 0x55103D: ts_lexer__get_chunk (lexer.c:29)
==8694== by 0x5515B6: ts_lexer_start (lexer.c:152)
==8694== by 0x54699F: parser(long,...)(long long) (parser.c:297)
==8694== by 0x54788A: parser__get_lookahead (parser.c:439)
==8694== by 0x54B2D3: parser__advance (parser.c:1150)
==8694== by 0x54C2AA: parser_parse (parser.c:1348)
==8694== by 0x53F063: ts_document_parse_with_options (document.c:136)
==8694== by 0x53EF43: ts_document_parse (document.c:107)
==8694== by 0x4AED11: {lambda()#1}::operator()() const::{lambda()#1}::operator()() const::{lambda()#4}::operator()() const::{lambda()#4}::operator()() const (document_test.cc:82)
==8694== by 0x4B56B6: std::_Function_handler<void (), {lambda()#1}::operator()() const::{lambda()#1}::operator()() const::{lambda()#4}::operator()() const::{lambda()#4}>::_M_invoke(std::_Any_data const&) (functional:1871)
```
Also simplify the test so we call `utf16_iterate` directly. Calling
`utf16_iterate` via `SpyInput` and `ts_document_parse` doesn't seem to reliably
trigger the problem using valgrind.
valgrind also doesn't detect the problem if we use a string literal like:
`utf16_iterate("", 1, &code_point);`
utf16_iterate does not check that 'length' is a multiple of two which leads to
an out-of-bound read:
==105293== Conditional jump or move depends on uninitialised value(s)
==105293== at 0x54F014: utf16_iterate (utf16.c:7)
==105293== by 0x539251: string_iterate(TSInputEncoding, unsigned char const*, unsigned long, int*) (encoding_helpers.cc:15)
==105293== by 0x53939D: string_byte_for_character(TSInputEncoding, std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > const&, unsigned long, unsigned long) (encoding_helpers.cc:43)
==105293== by 0x507BAD: SpyInput::read(void*, unsigned int*) (spy_input.cc:47)
==105293== by 0x551049: ts_lexer__get_chunk (lexer.c:29)
==105293== by 0x5515C2: ts_lexer_start (lexer.c:152)
==105293== by 0x5469AB: parser(long,...)(long long) (parser.c:297)
==105293== by 0x547896: parser__get_lookahead (parser.c:439)
==105293== by 0x54B2DF: parser__advance (parser.c:1150)
==105293== by 0x54C2B6: parser_parse (parser.c:1348)
==105293== by 0x53F06F: ts_document_parse_with_options (document.c:136)
==105293== by 0x53EF4F: ts_document_parse (document.c:107)
This adds support for fuzzing tree-sitter grammars with libFuzzer. This
currently only works on Linux because of linking issues on macOS. Breifly, the
AddressSanitizer library is dynamically linked into the fuzzer binary and
cannot be found at runtime if built with a compiler that wasn't provided by
Xcode(?). The runtime library is statically linked on Linux so this isn't a
problem.
This speeds up parser generation by increasing the likelihood that we'll recognize
parse item sets as equivalent in advance, rather than having to merge their states
after the fact.
While generating the parse table, keep track of which tokens can follow one another.
Then use this information to evaluate token conflicts more precisely. This will
result in a smaller parse table than the previous, overly-conservative approach.