Perform keyword optimization using explicitly selected word token

rather than trying to infer the word token automatically.

Co-Authored-By: Ashi Krishnan <queerviolet@github.com>
This commit is contained in:
Max Brunsfeld 2018-06-13 16:54:11 -07:00
parent 0e487011c0
commit e17cd42e47
12 changed files with 142 additions and 99 deletions

View file

@ -217,6 +217,7 @@ In addition to the `name` and `rules` fields, grammars have a few other public f
* `inline` - an array of rule names that should be automatically *removed* from the grammar by replacing all of their usages with a copy of their definition. This is useful for rules that are used in multiple places but for which you *don't* want to create syntax tree nodes at runtime.
* `conflicts` - an array of arrays of rule names. Each inner array represents a set of rules that's involved in an *LR(1) conflict* that is *intended to exist* in the grammar. When these conflicts occur at runtime, Tree-sitter will use the GLR algorithm to explore all of the possible interpretations. If *multiple* parses end up succeeding, Tree-sitter will pick the subtree rule with the highest *dynamic precedence*.
* `externals` - an array of toen names which can be returned by an *external scanner*. External scanners allow you to write custom C code which runs during the lexing process in order to handle lexical rules (e.g. Python's indentation tokens) that cannot be described by regular expressions.
* `word` - the name of a token that will match keywords for the purpose of [keyword-optimization](#keyword-optimization).
## Adjusting existing grammars
@ -359,6 +360,29 @@ You may have noticed in the above examples that some of the grammar rule name li
TODO
## Keyword Optimization
Many languages have a set of keywords. Typically, these aren't identifiers, but
look like them. For example, in Algol-like languages, `if` is a keyword. It could
be a variable name, and in some contexts (e.g. javascript object literals like
`{if: 'something'}`) it might be interpreted as a variable, but in many contexts,
it has special meaning.
You'll know if you have them, because keywords end up in the grammar as strings
or regexes that match a small finite set of strings.
The naïve parser generated from such a grammar can be huge and take forever to
compile. Keyword optimization is the fix. Instead of building a parser which
looks for `choice('break', 'continue', 'async', ...etc)` wherever they
might occur, `word: $ => $.identifier` will instruct Tree-sitter to instead try
and parse an `identifier` where it was going to try and parse one of those keywords,
and then check to see if the parsed `identifier` actually does match a keyword.
You don't have to specify what words actually are keywords. Tree-sitter will
identify these automatically, as the set of terminals that your word could
match.
[cst]: https://en.wikipedia.org/wiki/Parse_tree
[non-terminal]: https://en.wikipedia.org/wiki/Terminal_and_nonterminal_symbols
[language-spec]: https://en.wikipedia.org/wiki/Programming_language_specification