parser__select_tree can return true if 'left != NULL' and 'right == NULL' which
will later cause a NULL ptr deref:
src/runtime/parser.c:842:14: warning: Access to field 'ref_count' results in a dereference of a null pointer (loaded from variable 'root')
assert(root->ref_count > 0);
^~~~~~~~~~~~~~~
Because repair_reduction_count is unsigned, the default of '-1' is 0xffffffff
and will cause the loop to be entered if repair_reduction_count is NULL:
src/runtime/parser.c:691:11: warning: Dereference of null pointer
if (repair_reductions[j].params.symbol == repair->symbol) {
^~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
Some ParseActions have a state-id of -1 which can cause an out-of-bounds read
when removing duplicate parse states. This was found by AddressSanitizer:
==90699==ERROR: AddressSanitizer: heap-buffer-overflow on address 0x6320000187f8 at pc 0x0001071220a9 bp 0x7fff595fd440 sp 0x7fff595fd438
READ of size 8 at 0x6320000187f8 thread T0
#0 0x1071220a8 in tree_sitter::build_tables::ParseTableBuilder::remove_duplicate_parse_states()::'lambda0'(unsigned long*)::operator()(unsigned long*) const build_parse_table.cc:398
#1 0x107121fa5 in void std::__1::__invoke_void_return_wrapper<void>::__call<tree_sitter::build_tables::ParseTableBuilder::remove_duplicate_parse_states()::'lambda0'(unsigned long*)&, unsigned long*>(tree_sitter::build_tables::ParseTableBuilder::remove_duplicate_parse_states()::'lambda0'(unsigned long*)&&&, unsigned long*&&) __functional_base:416
...
0x6320000187f8 is located 8 bytes to the left of 88264-byte region [0x632000018800,0x63200002e0c8)
allocated by thread T0 here:
#0 0x107b1576b in wrap__Znwm (libclang_rt.asan_osx_dynamic.dylib:x86_64h+0x6076b)
#1 0x10711da2c in std::__1::vector<unsigned long, std::__1::allocator<unsigned long> >::allocate(unsigned long) new:169
#2 0x10711d8fb in std::__1::vector<unsigned long, std::__1::allocator<unsigned long> >::vector(unsigned long) vector:1074
#3 0x107112f5c in std::__1::vector<unsigned long, std::__1::allocator<unsigned long> >::vector(unsigned long) vector:1068
#4 0x1070af381 in tree_sitter::build_tables::ParseTableBuilder::remove_duplicate_parse_states() build_parse_table.cc:378
#5 0x10709d827 in tree_sitter::build_tables::ParseTableBuilder::build() build_parse_table.cc:85
...
SUMMARY: AddressSanitizer: heap-buffer-overflow build_parse_table.cc:398 in tree_sitter::build_tables::ParseTableBuilder::remove_duplicate_parse_states()::'lambda0'(unsigned long*)::operator()(unsigned long*) const
Shadow bytes around the buggy address:
0x1c64000030a0: fa fa fa fa fa fa fa fa fa fa fa fa fa fa fa fa
0x1c64000030b0: fa fa fa fa fa fa fa fa fa fa fa fa fa fa fa fa
0x1c64000030c0: fa fa fa fa fa fa fa fa fa fa fa fa fa fa fa fa
0x1c64000030d0: fa fa fa fa fa fa fa fa fa fa fa fa fa fa fa fa
0x1c64000030e0: fa fa fa fa fa fa fa fa fa fa fa fa fa fa fa fa
=>0x1c64000030f0: fa fa fa fa fa fa fa fa fa fa fa fa fa fa fa[fa]
0x1c6400003100: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00
0x1c6400003110: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00
0x1c6400003120: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00
0x1c6400003130: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00
0x1c6400003140: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00
The current pretty conservative approach is to avoid merging parse states which
would cause a pair tokens to co-exist for the first time in any parse state,
where the two tokens can start with the same character and at least one of the
tokens can contain a character which is part of the grammar's separators.
* Remove remnants of templatized remove_duplicate_states function
* Rename recovery_tokens function to get_compatible_tokens and augment it
also compute pairs of tokens which could potentially be incompatible