I think that state matching is the only correct strategy for incremental
node reuse that is compatible with the new error recovery algorithm. It's
also simpler than the sentential-form algorithm. With the compressed parse
tables, state matching shouldn't be too conservative of a test.
Before, any syntax error would cause the lexer to create an error
leaf node. This could happen even with a valid input, if the parse
stack had split and one particular version of the parse stack
failed to parse.
Now, an error leaf node is only created when the lexer cannot understand
part of the input stream at all. When a normal syntax error occurs,
the lexer just returns a token that is outside of the expected token
set, and the parser handles the unexpected token.
Suppose a parse state S has multiple actions for a terminal lookahead symbol A.
Then during incremental parsing, while in state S, the parser should not
reuse a non-terminal lookahead B where FIRST(B) contains A, because reusing B
might prematurely discard one of the possible actions that a batch parser
would have attempted in state S, upon seeing A as a lookahead.
Rather than letting the reduced tree become the new lookahead symbol,
and re-adding it to the stack via a subsequent shift action, just
add it to the stack as part of the reduce action. This is more in
line with the way LR is described traditionally.
Previously, the way repeat rules were expanded, the auxiliary
rule always needed to be reduced, even if the repeating content
was empty. This caused problems in parse states where some items
contained the repeat rule and some did not. To make those cases
work, the repeat rule had to explicitly be marked as optional.
With this change, that is no longer necessary.
This reverts commit 5cd07648fd.
The separators construct is useful as an optimization. It turns out that
constructing a node for every chunk of whitespace in a document causes a
significant performance regression.
Conflicts:
src/compiler/build_tables/build_lex_table.cc
src/compiler/grammar.cc
src/runtime/parser.c
When breaking down the stack in parser.c, the previous code
would not account for ubiquitous tokens. This was a problem
for a long time, but wasn't noticed until ubiquitous tokens
started being used to represent separator characters
Now, grammars can handle whitespace by making it another ubiquitous
token, like comments.
For now, this has the side effect of whitespace being included in the
tree that precedes it. This was already an issue for other ubiquitous
tokens though, so it needs to be fixed anyway.