Also simplify the test so we call `utf16_iterate` directly. Calling
`utf16_iterate` via `SpyInput` and `ts_document_parse` doesn't seem to reliably
trigger the problem using valgrind.
valgrind also doesn't detect the problem if we use a string literal like:
`utf16_iterate("", 1, &code_point);`
utf16_iterate does not check that 'length' is a multiple of two which leads to
an out-of-bound read:
==105293== Conditional jump or move depends on uninitialised value(s)
==105293== at 0x54F014: utf16_iterate (utf16.c:7)
==105293== by 0x539251: string_iterate(TSInputEncoding, unsigned char const*, unsigned long, int*) (encoding_helpers.cc:15)
==105293== by 0x53939D: string_byte_for_character(TSInputEncoding, std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > const&, unsigned long, unsigned long) (encoding_helpers.cc:43)
==105293== by 0x507BAD: SpyInput::read(void*, unsigned int*) (spy_input.cc:47)
==105293== by 0x551049: ts_lexer__get_chunk (lexer.c:29)
==105293== by 0x5515C2: ts_lexer_start (lexer.c:152)
==105293== by 0x5469AB: parser(long,...)(long long) (parser.c:297)
==105293== by 0x547896: parser__get_lookahead (parser.c:439)
==105293== by 0x54B2DF: parser__advance (parser.c:1150)
==105293== by 0x54C2B6: parser_parse (parser.c:1348)
==105293== by 0x53F06F: ts_document_parse_with_options (document.c:136)
==105293== by 0x53EF4F: ts_document_parse (document.c:107)
For some reason, there was previously some extra logic that prevented
the external scanner from being invoked if the only valid external
token also had an internal definition.
It's surprising to not call the external scanner if an external
token is valid.
This adds support for fuzzing tree-sitter grammars with libFuzzer. This
currently only works on Linux because of linking issues on macOS. Breifly, the
AddressSanitizer library is dynamically linked into the fuzzer binary and
cannot be found at runtime if built with a compiler that wasn't provided by
Xcode(?). The runtime library is statically linked on Linux so this isn't a
problem.
This speeds up parser generation by increasing the likelihood that we'll recognize
parse item sets as equivalent in advance, rather than having to merge their states
after the fact.
While generating the parse table, keep track of which tokens can follow one another.
Then use this information to evaluate token conflicts more precisely. This will
result in a smaller parse table than the previous, overly-conservative approach.