Removing this restriction created problems for the Rust grammar, and
possibly others. The proper fix would be to ensure that the 'word
token' matches *every* possible string that a 'keyword token'
matches, as opposed to just matching *some* of the same strings.
This would require us to gather a little more information
about how tokens conflict. For now, I'm just going to put back the
hard-coded logic that we had.
This prevented conflicts between some tokens from being recorded
properly. In the case of JavaScript, it prevented tree-sitter from
recognizing the conflict between the forward slash operator and the
regex token, allowing regexes to be merged into parse states containing
'/' incorrectly.
Refs tree-sitter/tree-sitter-javascript#71
This simplifies the logic for determining whether a token is reusable
and makes it more conservative. It should fix some incremental parsing
bugs that are being caught by the randomized tests on CI.
They were a vestige of when Tree-sitter did sentential form-based
incremental parsing (as opposed to simply state matching). This was
elegant but not compatible with GLR as far as I could tell.
Do not merge a token T into a parse state S if S contains
external tokens that can be *followed* by tokens that could
be shadowed by T.
At this point, the only automated test for this logic is via
the bash grammar, in which the `]` token should not be merged
into states in which `_concat` is valid, because `_concat`
can be followed by a `_special_characters` token, and `]`
would shadow `_special_characters`.
The previous approach to error recovery relied on special error-recovery
states in the parse table. For each token T, there was an error recovery
state in which the parser looked for *any* token that could follow T.
Unfortunately, sometimes the set of tokens that could follow T contained
conflicts. For example, in JS, the token '}' can be followed by the
open-ended 'template_chars' token, but also by ordinary tokens like
'identifier'. So with the old algorithm, when recovering from an
unexpected '}' token, the lexer had no way to distinguish identifiers
from template_chars.
This commit drops the error recovery states. Instead, when we encounter
an unexpected token T, we recover from the error by finding a previous
state S in the stack in which T would be valid, popping all of the nodes
after S, and wrapping them in an error.
This way, the lexer is always invoked in a normal parse state, in which
it is looking for a non-conflicting set of tokens. Eliminating the error
recovery states also shrinks the lex state machine significantly.
Signed-off-by: Rick Winfrey <rewinfrey@github.com>