Merge pull request #176 from tree-sitter/explicit-word-token

Perform keyword optimization using explicitly selected word token
2018-06-14 13:19:11 -07:00 · 2018-06-14 13:19:11 -07:00 · 245052442a
commit 245052442a
parent 0e487011c0 f42cb877f3
22 changed files with 305 additions and 247 deletions
--- a/docs/_layouts/default.html
+++ b/docs/_layouts/default.html
@ -32,6 +32,7 @@
              <div id="current-page-table-of-contents">
                {% capture whitespace %}
                {% assign min_header = 2 %}
+                {% assign max_header = 3 %}
                {% assign nodes = content | split: "<h" %}
                {% assign first_header = true %}
                {% for node in nodes %}
@ -41,7 +42,7 @@

                  {% assign header_level = node | replace: '"', '' | slice: 0, 1 | times: 1 %}

-                  {% if header_level < min_header or header_level > maxHeader %}
+                  {% if header_level < min_header or header_level > max_header %}
                    {% continue %}
                  {% endif %}

@ -127,7 +128,7 @@
    }
  });

-  $('h1, h2, h3, h4, h5, h6').filter('[id]').each(function() {
+  $('h1, h2, h3').filter('[id]').each(function() {
    $(this).html('<a href="#'+$(this).attr('id')+'">' + $(this).text() + '</a>');
  });
 </script>
--- a/docs/section-3-creating-parsers.md
+++ b/docs/section-3-creating-parsers.md
@ -211,12 +211,13 @@ The following is a complete list of built-in functions you can use to define Tre
 * **Tokens : `token(rule)`** - This function marks the given rule as producing only a single token. Tree-sitter's default is to treat each String or RegExp literal in the grammar as a separate token. Each token is matched separately by the lexer and returned as its own leaf node in the tree. The `token` function allows you to express a complex rule using the functions described above (rather than as a single regular expression) but still have Tree-sitter treat it as a single token.
 * **Aliases : `alias(rule, name)`** - This function causes the given rule to *appear* with an alternative name in the syntax tree. It is useful in cases where a language construct needs to be parsed differently in different contexts (and thus needs to be defined using multiple symbols), but should always *appear* as the same type of node.

-In addition to the `name` and `rules` fields, grammars have a few other public fields that influence the behavior of the parser.
+In addition to the `name` and `rules` fields, grammars have a few other optional public fields that influence the behavior of the parser.

 * `extras` - an array of tokens that may appear *anywhere* in the language. This is often used for whitespace and comments. The default for `extras` in `tree-sitter-cli` is to accept whitespace. To control whitespace explicitly, specify `extras=[]` in the grammar.
 * `inline` - an array of rule names that should be automatically *removed* from the grammar by replacing all of their usages with a copy of their definition. This is useful for rules that are used in multiple places but for which you *don't* want to create syntax tree nodes at runtime.
 * `conflicts` - an array of arrays of rule names. Each inner array represents a set of rules that's involved in an *LR(1) conflict* that is *intended to exist* in the grammar. When these conflicts occur at runtime, Tree-sitter will use the GLR algorithm to explore all of the possible interpretations. If *multiple* parses end up succeeding, Tree-sitter will pick the subtree rule with the highest *dynamic precedence*.
 * `externals` - an array of toen names which can be returned by an *external scanner*. External scanners allow you to write custom C code which runs during the lexing process in order to handle lexical rules (e.g. Python's indentation tokens) that cannot be described by regular expressions.
+* `word` - the name of a token that will match keywords for the purpose of the [keyword extraction](#keyword-extraction) optimization.

 ## Adjusting existing grammars

@ -355,11 +356,81 @@ For an expression like `a * b * c`, it's not clear whether we mean `a * (b * c)`

 You may have noticed in the above examples that some of the grammar rule name like `_expression` and `_type` began with an underscore. Starting a rule's name with an underscore causes the rule to be *hidden* in the syntax tree. This is useful for rules like `_expression` in the grammars above, which always just wrap a single child node. If these nodes were not hidden, they would add substantial depth and noise to the syntax tree without making it any easier to understand.

-## Dealing with LR conflicts
+### Dealing with LR conflicts

-TODO
+...

+## Lexical Analysis
+
+Tree-sitter's parsing process is divided into two phases: parsing (which is described above) and [lexing](lexing) - the process of grouping individual characters into the language's fundamental *tokens*. There are a few important things to know about how Tree-sitter's lexing works.
+
+### Conflict Resolution
+
+Grammars often contain multiple tokens that can match the same characters. For example, a grammar might contain the tokens (`"if"` and `/[a-z]+/`). Tree-sitter differentiates between these conflicting tokens in a few ways:
+
+1. **Context-aware lexing** - Tree-sitter performs lexing on-demand, during the parsing process. At any given position in a source document, the lexer only tries to recognize tokens that are *valid* at that position in the document.
+
+2. **Longest-match** - If multiple valid tokens match the characters at a given position in a document, Tree-sitter will select the token that matches the [longest sequence of characters](longest-match).
+
+3. **Lexical Precedence** - When the precedence functions described [above](#using-the-grammar-dsl) are used within the `token` function, the given precedence values serve as instructions to the lexer. If there are two valid tokens that match the same sequence of characters, Tree-sitter will select the one with the higher precedence.
+
+### Keywords
+
+If your language has keywords which are matched by a rule (typically `identifier`), you can tell Tree-sitter about it with your grammar's `word` property.
+
+```js
+grammar({
+  word: $ => $.identifier,
+
+  rules: {
+    class_declaration: $ => seq(
+      'class',
+      $.identifier,
+      $.class_body
+    ),
+
+    break_statement: $ => seq('break', ';'),
+
+    continue_statement: $ => seq('continue', ';'),
+
+    identifier: $ => /[a-z]+/
+  }
+})
+```
+
+In this case, we're specifying `identifier` as our `word`. Tree-sitter will automatically find the set of terminals which are matched by `$.identifier`, and consider them keywords. Instead of generating a parser which scans for each keyword individually, Tree-sitter will generate a parser that tries to match the word rule (in this case, `identifier`), and checks to see if the matched word is the necessary keyword.
+
+This makes the set of parse states smaller, so the parser compiles faster.
+
+It *also changes behavior*. Consider this grammar:
+
+```js
+grammar({
+  rules: {
+    import: $ => seq(
+      'import',
+      $.identifier,
+      'as',
+      $.identifier
+    ),
+
+    identifier: $ => /[a-z]+/
+  }
+})
+```
+
+Without the `word` directive, the grammar matches this input:
+
+```
+import foo asbar
+```
+
+Which is probably not what you want. If we add `word: $ => $.identifier`, this will no longer parse. When we try to parse `'as'`, we will parse a word — which will be the identifier ``'asbar'``—and then compare it to `'as'`, correctly generating an error.
+
+[lexing]: https://en.wikipedia.org/wiki/Lexical_analysis
+[longest-match]: https://en.wikipedia.org/wiki/Maximal_munch
 [cst]: https://en.wikipedia.org/wiki/Parse_tree
+[dfa]: https://en.wikipedia.org/wiki/Deterministic_finite_automaton
 [non-terminal]: https://en.wikipedia.org/wiki/Terminal_and_nonterminal_symbols
 [language-spec]: https://en.wikipedia.org/wiki/Programming_language_specification
 [glr-parsing]: https://en.wikipedia.org/wiki/GLR_parser
--- a/include/tree_sitter/compiler.h
+++ b/include/tree_sitter/compiler.h
@ -19,6 +19,7 @@ typedef enum {
  TSCompileErrorTypeEpsilonRule,
  TSCompileErrorTypeInvalidTokenContents,
  TSCompileErrorTypeInvalidRuleName,
+  TSCompileErrorTypeInvalidWordRule,
 } TSCompileErrorType;

 typedef struct {
--- a/src/compiler/build_tables/lex_table_builder.cc
+++ b/src/compiler/build_tables/lex_table_builder.cc
@ -49,6 +49,19 @@ using rules::Symbol;
 using rules::Metadata;
 using rules::Seq;

+enum ConflictStatus {
+  DoesNotMatch = 0,
+  MatchesShorterStringWithinSeparators = 1 << 0,
+  MatchesSameString = 1 << 1,
+  MatchesLongerString = 1 << 2,
+  MatchesLongerStringWithValidNextChar = 1 << 3,
+  CannotDistinguish = (
+    MatchesShorterStringWithinSeparators |
+    MatchesSameString |
+    MatchesLongerStringWithValidNextChar
+  ),
+};
+
 static const std::unordered_set<ParseStateId> EMPTY;

 bool CoincidentTokenIndex::contains(Symbol a, Symbol b) const {
@ -65,14 +78,12 @@ const std::unordered_set<ParseStateId> &CoincidentTokenIndex::states_with(Symbol
  }
 }

-template <bool include_all>
-class CharacterAggregator {
+class StartingCharacterAggregator {
 public:
  void apply(const Rule &rule) {
    rule.match(
      [this](const Seq &sequence) {
        apply(*sequence.left);
-        if (include_all) apply(*sequence.right);
      },

      [this](const rules::Choice &rule) {
@ -91,9 +102,6 @@ class CharacterAggregator {
  CharacterSet result;
 };

-using StartingCharacterAggregator = CharacterAggregator<false>;
-using AllCharacterAggregator = CharacterAggregator<true>;
-
 class LexTableBuilderImpl : public LexTableBuilder {
  LexTable main_lex_table;
  LexTable keyword_lex_table;
@ -109,7 +117,7 @@ class LexTableBuilderImpl : public LexTableBuilder {
  vector<ConflictStatus> conflict_matrix;
  bool conflict_detection_mode;
  LookaheadSet keyword_symbols;
-  Symbol keyword_capture_token;
+  Symbol word_rule;
  char encoding_buffer[8];

 public:
@ -125,7 +133,7 @@ class LexTableBuilderImpl : public LexTableBuilder {
      parse_table(parse_table),
      conflict_matrix(lexical_grammar.variables.size() * lexical_grammar.variables.size(), DoesNotMatch),
      conflict_detection_mode(false),
-      keyword_capture_token(rules::NONE()) {
+      word_rule(syntax_grammar.word_rule) {

    // Compute the possible separator rules and the set of separator characters that can occur
    // immediately after any token.
@ -141,7 +149,6 @@ class LexTableBuilderImpl : public LexTableBuilder {
    // characters that can follow each token. Also identify all of the tokens that can be
    // considered 'keywords'.
    LOG_START("characterizing tokens");
-    LookaheadSet potential_keyword_symbols;
    for (unsigned i = 0, n = grammar.variables.size(); i < n; i++) {
      Symbol token = Symbol::terminal(i);

@ -158,31 +165,6 @@ class LexTableBuilderImpl : public LexTableBuilder {
        });
      }
      following_characters_by_token[i] = following_character_aggregator.result;
-
-      AllCharacterAggregator all_character_aggregator;
-      all_character_aggregator.apply(grammar.variables[i].rule);
-
-      if (
-        !starting_character_aggregator.result.includes_all &&
-        !all_character_aggregator.result.includes_all
-      ) {
-        bool starts_alpha = true, all_alnum = true;
-        for (auto character : starting_character_aggregator.result.included_chars) {
-          if (!iswalpha(character) && character != '_') {
-            starts_alpha = false;
-          }
-        }
-        for (auto character : all_character_aggregator.result.included_chars) {
-          if (!iswalnum(character) && character != '_') {
-            all_alnum = false;
-          }
-        }
-        if (starts_alpha && all_alnum) {
-          LOG("potential keyword: %s", token_name(token).c_str());
-          potential_keyword_symbols.insert(token);
-        }
-      }
-
    }
    LOG_END();

@ -205,98 +187,83 @@ class LexTableBuilderImpl : public LexTableBuilder {
    }
    LOG_END();

-    LOG_START("finding keyword capture token");
-    for (Symbol::Index i = 0, n = grammar.variables.size(); i < n; i++) {
-      Symbol candidate = Symbol::terminal(i);
+    if (word_rule != rules::NONE()) {
+      identify_keywords();
+    }
+  }

-      LookaheadSet homonyms;
-      potential_keyword_symbols.for_each([&](Symbol other_token) {
-        if (get_conflict_status(other_token, candidate) & MatchesShorterStringWithinSeparators) {
-          homonyms.clear();
-          return false;
-        }
-        if (get_conflict_status(candidate, other_token) == MatchesSameString) {
-          homonyms.insert(other_token);
-        }
-        return true;
-      });
-      if (homonyms.empty()) continue;
-
-      LOG_START(
-        "keyword capture token candidate: %s, homonym count: %lu",
-        token_name(candidate).c_str(),
-        homonyms.size()
-      );
-
-      homonyms.for_each([&](Symbol homonym1) {
-        homonyms.for_each([&](Symbol homonym2) {
-          if (get_conflict_status(homonym1, homonym2) & MatchesSameString) {
-            LOG(
-              "conflict between homonyms %s %s",
-              token_name(homonym1).c_str(),
-              token_name(homonym2).c_str()
-            );
-            homonyms.remove(homonym1);
-          }
-          return false;
-        });
-        return true;
-      });
-
-      for (Symbol::Index j = 0; j < n; j++) {
-        Symbol other_token = Symbol::terminal(j);
-        if (other_token == candidate || homonyms.contains(other_token)) continue;
-        bool candidate_shadows_other = get_conflict_status(other_token, candidate);
-        bool other_shadows_candidate = get_conflict_status(candidate, other_token);
-
-        if (candidate_shadows_other || other_shadows_candidate) {
-          homonyms.for_each([&](Symbol homonym) {
-            bool other_shadows_homonym = get_conflict_status(homonym, other_token);
-
-            bool candidate_was_already_present = true;
-            for (ParseStateId state_id : coincident_token_index.states_with(homonym, other_token)) {
-              if (!parse_table->states[state_id].has_terminal_entry(candidate)) {
-                candidate_was_already_present = false;
-                break;
-              }
-            }
-            if (candidate_was_already_present) return true;
-
-            if (candidate_shadows_other) {
-              homonyms.remove(homonym);
-              LOG(
-                "remove %s because candidate would shadow %s",
-                token_name(homonym).c_str(),
-                token_name(other_token).c_str()
-              );
-            } else if (other_shadows_candidate && !other_shadows_homonym) {
-              homonyms.remove(homonym);
-              LOG(
-                "remove %s because %s would shadow candidate",
-                token_name(homonym).c_str(),
-                token_name(other_token).c_str()
-              );
-            }
-            return true;
-          });
-        }
+  void identify_keywords() {
+    LookaheadSet homonyms;
+    for (Symbol::Index j = 0, n = grammar.variables.size(); j < n; j++) {
+      Symbol other_token = Symbol::terminal(j);
+      if (get_conflict_status(word_rule, other_token) == MatchesSameString) {
+        homonyms.insert(other_token);
      }
-
-      if (homonyms.size() > keyword_symbols.size()) {
-        LOG_START("found capture token. homonyms:");
-        homonyms.for_each([&](Symbol homonym) {
-          LOG("%s", token_name(homonym).c_str());
-          return true;
-        });
-        LOG_END();
-        keyword_symbols = homonyms;
-        keyword_capture_token = candidate;
-      }
-
-      LOG_END();
    }

-    LOG_END();
+    homonyms.for_each([&](Symbol homonym1) {
+      homonyms.for_each([&](Symbol homonym2) {
+        if (get_conflict_status(homonym1, homonym2) & MatchesSameString) {
+          LOG(
+            "conflict between homonyms %s %s",
+            token_name(homonym1).c_str(),
+            token_name(homonym2).c_str()
+          );
+          homonyms.remove(homonym1);
+        }
+        return false;
+      });
+      return true;
+    });
+
+    for (Symbol::Index j = 0, n = grammar.variables.size(); j < n; j++) {
+      Symbol other_token = Symbol::terminal(j);
+      if (other_token == word_rule || homonyms.contains(other_token)) continue;
+      bool word_rule_shadows_other = get_conflict_status(other_token, word_rule);
+      bool other_shadows_word_rule = get_conflict_status(word_rule, other_token);
+
+      if (word_rule_shadows_other || other_shadows_word_rule) {
+        homonyms.for_each([&](Symbol homonym) {
+          bool other_shadows_homonym = get_conflict_status(homonym, other_token);
+
+          bool word_rule_was_already_present = true;
+          for (ParseStateId state_id : coincident_token_index.states_with(homonym, other_token)) {
+            if (!parse_table->states[state_id].has_terminal_entry(word_rule)) {
+              word_rule_was_already_present = false;
+              break;
+            }
+          }
+          if (word_rule_was_already_present) return true;
+
+          if (word_rule_shadows_other) {
+            homonyms.remove(homonym);
+            LOG(
+              "remove %s because word_rule would shadow %s",
+              token_name(homonym).c_str(),
+              token_name(other_token).c_str()
+            );
+          } else if (other_shadows_word_rule && !other_shadows_homonym) {
+            homonyms.remove(homonym);
+            LOG(
+              "remove %s because %s would shadow word_rule",
+              token_name(homonym).c_str(),
+              token_name(other_token).c_str()
+            );
+          }
+          return true;
+        });
+      }
+    }
+
+    if (!homonyms.empty()) {
+      LOG_START("found keywords:");
+      homonyms.for_each([&](Symbol homonym) {
+        LOG("%s", token_name(homonym).c_str());
+        return true;
+      });
+      LOG_END();
+      keyword_symbols = homonyms;
+    }
  }

  BuildResult build() {
@ -307,8 +274,8 @@ class LexTableBuilderImpl : public LexTableBuilder {
    for (ParseState &parse_state : parse_table->states) {
      LookaheadSet token_set;
      for (auto &entry : parse_state.terminal_entries) {
-        if (keyword_capture_token.is_terminal() && keyword_symbols.contains(entry.first)) {
-          token_set.insert(keyword_capture_token);
+        if (word_rule.is_terminal() && keyword_symbols.contains(entry.first)) {
+          token_set.insert(word_rule);
        } else {
          token_set.insert(entry.first);
        }
@ -337,7 +304,19 @@ class LexTableBuilderImpl : public LexTableBuilder {

    mark_fragile_tokens();
    remove_duplicate_lex_states(main_lex_table);
-    return {main_lex_table, keyword_lex_table, keyword_capture_token};
+    return {main_lex_table, keyword_lex_table, word_rule};
+  }
+
+  bool does_token_shadow_other(Symbol token, Symbol shadowed_token) const {
+    if (token == word_rule && keyword_symbols.contains(shadowed_token)) return false;
+    return get_conflict_status(shadowed_token, token) & (
+      MatchesShorterStringWithinSeparators |
+      MatchesLongerStringWithValidNextChar
+    );
+  }
+
+  bool does_token_match_same_string_as_other(Symbol token, Symbol shadowed_token) const {
+    return get_conflict_status(shadowed_token, token) & MatchesSameString;
  }

  ConflictStatus get_conflict_status(Symbol shadowed_token, Symbol other_token) const {
@ -410,12 +389,14 @@ class LexTableBuilderImpl : public LexTableBuilder {
                advance_symbol,
                MatchesLongerStringWithValidNextChar
              )) {
-                LOG(
-                  "%s shadows %s followed by '%s'",
-                  token_name(advance_symbol).c_str(),
-                  token_name(accept_action.symbol).c_str(),
-                  log_char(*conflicting_following_chars.included_chars.begin())
-                );
+                if (!conflicting_following_chars.included_chars.empty()) {
+                  LOG(
+                    "%s shadows %s followed by '%s'",
+                    token_name(advance_symbol).c_str(),
+                    token_name(accept_action.symbol).c_str(),
+                    log_char(*conflicting_following_chars.included_chars.begin())
+                  );
+                }
              }
            }
          }
@ -665,8 +646,12 @@ LexTableBuilder::BuildResult LexTableBuilder::build() {
  return static_cast<LexTableBuilderImpl *>(this)->build();
 }

-ConflictStatus LexTableBuilder::get_conflict_status(Symbol a, Symbol b) const {
-  return static_cast<const LexTableBuilderImpl *>(this)->get_conflict_status(a, b);
+bool LexTableBuilder::does_token_shadow_other(Symbol a, Symbol b) const {
+  return static_cast<const LexTableBuilderImpl *>(this)->does_token_shadow_other(a, b);
+}
+
+bool LexTableBuilder::does_token_match_same_string_as_other(Symbol a, Symbol b) const {
+  return static_cast<const LexTableBuilderImpl *>(this)->does_token_match_same_string_as_other(a, b);
 }

 }  // namespace build_tables
--- a/src/compiler/build_tables/lex_table_builder.h
+++ b/src/compiler/build_tables/lex_table_builder.h
@ -30,19 +30,6 @@ namespace build_tables {

 class LookaheadSet;

-enum ConflictStatus {
-  DoesNotMatch = 0,
-  MatchesShorterStringWithinSeparators = 1 << 0,
-  MatchesSameString = 1 << 1,
-  MatchesLongerString = 1 << 2,
-  MatchesLongerStringWithValidNextChar = 1 << 3,
-  CannotDistinguish = (
-    MatchesShorterStringWithinSeparators |
-    MatchesSameString |
-    MatchesLongerStringWithValidNextChar
-  ),
-};
-
 struct CoincidentTokenIndex {
  std::unordered_map<
    std::pair<rules::Symbol::Index, rules::Symbol::Index>,
@ -69,7 +56,8 @@ class LexTableBuilder {

  BuildResult build();

-  ConflictStatus get_conflict_status(rules::Symbol, rules::Symbol) const;
+  bool does_token_shadow_other(rules::Symbol, rules::Symbol) const;
+  bool does_token_match_same_string_as_other(rules::Symbol, rules::Symbol) const;

 protected:
  LexTableBuilder() = default;
--- a/src/compiler/build_tables/parse_table_builder.cc
+++ b/src/compiler/build_tables/parse_table_builder.cc
@ -134,11 +134,6 @@ class ParseTableBuilderImpl : public ParseTableBuilder {
  }

  void build_error_parse_state(ParseStateId state_id) {
-    unsigned CannotMerge = (
-      MatchesShorterStringWithinSeparators |
-      MatchesLongerStringWithValidNextChar
-    );
-
    parse_table.states[state_id].terminal_entries.clear();

    // First, identify the conflict-free tokens.
@ -149,7 +144,7 @@ class ParseTableBuilderImpl : public ParseTableBuilder {
      for (unsigned j = 0; j < lexical_grammar.variables.size(); j++) {
        Symbol other_token = Symbol::terminal(j);
        if (!coincident_token_index.contains(token, other_token) &&
-            (lex_table_builder->get_conflict_status(other_token, token) & CannotMerge)) {
+            lex_table_builder->does_token_shadow_other(token, other_token)) {
          conflicts_with_other_tokens = true;
          break;
        }
@ -171,7 +166,7 @@ class ParseTableBuilderImpl : public ParseTableBuilder {
        bool conflicts_with_other_tokens = false;
        conflict_free_tokens.for_each([&](Symbol other_token) {
          if (!coincident_token_index.contains(token, other_token) &&
-              (lex_table_builder->get_conflict_status(other_token, token) & CannotMerge)) {
+              lex_table_builder->does_token_shadow_other(token, other_token)) {
            LOG(
              "exclude %s: conflicts with %s",
              symbol_name(token).c_str(),
@ -517,7 +512,8 @@ class ParseTableBuilderImpl : public ParseTableBuilder {
    // Do not add a token if it conflicts with an existing token.
    if (!new_token.is_built_in()) {
      for (const auto &entry : state.terminal_entries) {
-        if (lex_table_builder->get_conflict_status(entry.first, new_token) & CannotDistinguish) {
+        if (lex_table_builder->does_token_shadow_other(new_token, entry.first) ||
+            lex_table_builder->does_token_match_same_string_as_other(new_token, entry.first)) {
          LOG_IF(
            logged_conflict_tokens.insert({entry.first, new_token}).second,
            "cannot merge parse states due to token conflict: %s and %s",
--- a/src/compiler/grammar.h
+++ b/src/compiler/grammar.h
@ -32,6 +32,7 @@ struct InputGrammar {
  std::vector<std::unordered_set<rules::NamedSymbol>> expected_conflicts;
  std::vector<rules::Rule> external_tokens;
  std::unordered_set<rules::NamedSymbol> variables_to_inline;
+  rules::NamedSymbol word_rule;
 };

 }  // namespace tree_sitter
--- a/src/compiler/log.cc
+++ b/src/compiler/log.cc
@ -1,4 +1,5 @@
 #include "compiler/log.h"
+#include <cassert>

 static const char *SPACES = "                                                           ";

@ -21,6 +22,7 @@ void _indent_logs() {
 }

 void _outdent_logs() {
+  assert(_indent_level > 0);
  _indent_level--;
 }

--- a/src/compiler/parse_grammar.cc
+++ b/src/compiler/parse_grammar.cc
@ -229,7 +229,9 @@ ParseGrammarResult parse_grammar(const string &input) {
  string error_message;
  string name;
  InputGrammar grammar;
-  json_value name_json, rules_json, extras_json, conflicts_json, external_tokens_json, inline_rules_json;
+  json_value
+    name_json, rules_json, extras_json, conflicts_json, external_tokens_json,
+    inline_rules_json, word_rule_json;

  json_settings settings = { 0, json_enable_comments, 0, 0, 0, 0 };
  char parse_error[json_error_max];
@ -359,6 +361,16 @@ ParseGrammarResult parse_grammar(const string &input) {
    }
  }

+  word_rule_json = grammar_json->operator[]("word");
+  if (word_rule_json.type != json_none) {
+    if (word_rule_json.type != json_string) {
+      error_message = "Invalid word property";
+      goto error;
+    }
+
+    grammar.word_rule = NamedSymbol { word_rule_json.u.string.ptr };
+  }
+
  json_value_free(grammar_json);
  return { name, grammar, "" };

--- a/src/compiler/prepare_grammar/expand_repeats.cc
+++ b/src/compiler/prepare_grammar/expand_repeats.cc
@ -106,6 +106,7 @@ InitialSyntaxGrammar expand_repeats(const InitialSyntaxGrammar &grammar) {
    expander.aux_rules.end()
  );

+  result.word_rule = grammar.word_rule;
  return result;
 }

--- a/src/compiler/prepare_grammar/extract_tokens.cc
+++ b/src/compiler/prepare_grammar/extract_tokens.cc
@ -329,6 +329,18 @@ tuple<InitialSyntaxGrammar, LexicalGrammar, CompileError> extract_tokens(
    }
  }

+  syntax_grammar.word_rule = symbol_replacer.replace_symbol(grammar.word_rule);
+  if (syntax_grammar.word_rule.is_non_terminal()) {
+    return make_tuple(
+      syntax_grammar,
+      lexical_grammar,
+      CompileError(
+        TSCompileErrorTypeInvalidWordRule,
+        "Word rules must be tokens"
+      )
+    );
+  }
+
  return make_tuple(syntax_grammar, lexical_grammar, CompileError::none());
 }

--- a/src/compiler/prepare_grammar/flatten_grammar.cc
+++ b/src/compiler/prepare_grammar/flatten_grammar.cc
@ -161,6 +161,8 @@ pair<SyntaxGrammar, CompileError> flatten_grammar(const InitialSyntaxGrammar &gr
    i++;
  }

+  result.word_rule = grammar.word_rule;
+  
  return {result, CompileError::none()};
 }

--- a/src/compiler/prepare_grammar/initial_syntax_grammar.h
+++ b/src/compiler/prepare_grammar/initial_syntax_grammar.h
@ -17,6 +17,7 @@ struct InitialSyntaxGrammar {
  std::set<std::set<rules::Symbol>> expected_conflicts;
  std::vector<ExternalToken> external_tokens;
  std::set<rules::Symbol> variables_to_inline;
+  rules::Symbol word_rule;
 };

 }  // namespace prepare_grammar
--- a/src/compiler/prepare_grammar/intern_symbols.cc
+++ b/src/compiler/prepare_grammar/intern_symbols.cc
@ -166,6 +166,8 @@ pair<InternedGrammar, CompileError> intern_symbols(const InputGrammar &grammar)
    }
  }

+  result.word_rule = interner.intern_symbol(grammar.word_rule);
+
  return {result, CompileError::none()};
 }

--- a/src/compiler/prepare_grammar/interned_grammar.h
+++ b/src/compiler/prepare_grammar/interned_grammar.h
@ -15,8 +15,8 @@ struct InternedGrammar {
  std::vector<rules::Rule> extra_tokens;
  std::set<std::set<rules::Symbol>> expected_conflicts;
  std::vector<Variable> external_tokens;
-  std::set<rules::Symbol> blank_external_tokens;
  std::set<rules::Symbol> variables_to_inline;
+  rules::Symbol word_rule;
 };

 }  // namespace prepare_grammar
--- a/src/compiler/syntax_grammar.h
+++ b/src/compiler/syntax_grammar.h
@ -60,6 +60,7 @@ struct SyntaxGrammar {
  std::set<std::set<rules::Symbol>> expected_conflicts;
  std::vector<ExternalToken> external_tokens;
  std::set<rules::Symbol> variables_to_inline;
+  rules::Symbol word_rule;
 };

 }  // namespace tree_sitter
--- a/src/runtime/array.h
+++ b/src/runtime/array.h
@ -110,7 +110,7 @@ static inline void array__grow(VoidArray *self, size_t element_size) {

 static inline void array__splice(VoidArray *self, size_t element_size,
                                 uint32_t index, uint32_t old_count,
-                                 uint32_t new_count, void *elements) {
+                                 uint32_t new_count, const void *elements) {
  uint32_t new_size = self->size + new_count - old_count;
  uint32_t old_end = index + old_count;
  uint32_t new_end = index + new_count;
--- a/src/runtime/node.c
+++ b/src/runtime/node.c
@ -28,11 +28,11 @@ static inline TSNode ts_node__null() {

 // TSNode - accessors

-uint32_t ts_node_start_byte(const TSNode self) {
+uint32_t ts_node_start_byte(TSNode self) {
  return self.context[0];
 }

-TSPoint ts_node_start_point(const TSNode self) {
+TSPoint ts_node_start_point(TSNode self) {
  return (TSPoint) {self.context[1], self.context[2]};
 }

--- a/src/runtime/subtree.c
+++ b/src/runtime/subtree.c
@ -59,7 +59,7 @@ bool ts_external_scanner_state_eq(const ExternalScannerState *a, const ExternalS
 // SubtreeArray

 bool ts_subtree_array_copy(SubtreeArray self, SubtreeArray *dest) {
-  const Subtree **contents = NULL;
+  Subtree **contents = NULL;
  if (self.capacity > 0) {
    contents = ts_calloc(self.capacity, sizeof(Subtree *));
    memcpy(contents, self.contents, self.size * sizeof(Subtree *));
--- a/test/compiler/build_tables/parse_item_set_builder_test.cc
+++ b/test/compiler/build_tables/parse_item_set_builder_test.cc
@ -25,7 +25,8 @@ describe("ParseItemSetBuilder", []() {
  LexicalGrammar lexical_grammar{lexical_variables, {}};

  it("adds items at the beginnings of referenced rules", [&]() {
-    SyntaxGrammar grammar{{
+    SyntaxGrammar grammar;
+    grammar.variables = {
      SyntaxVariable{"rule0", VariableTypeNamed, {
        Production({
          {Symbol::non_terminal(1), 0, AssociativityNone, Alias{}},
@ -47,7 +48,7 @@ describe("ParseItemSetBuilder", []() {
          {Symbol::terminal(15), 0, AssociativityNone, Alias{}},
        }, 0)
      }},
-    }, {}, {}, {}, {}};
+    };

    auto production = [&](int variable_index, int production_index) -> const Production & {
      return grammar.variables[variable_index].productions[production_index];
@ -84,7 +85,8 @@ describe("ParseItemSetBuilder", []() {
  });

  it("handles rules with empty productions", [&]() {
-    SyntaxGrammar grammar{{
+    SyntaxGrammar grammar;
+    grammar.variables = {
      SyntaxVariable{"rule0", VariableTypeNamed, {
        Production({
          {Symbol::non_terminal(1), 0, AssociativityNone, Alias{}},
@ -98,7 +100,7 @@ describe("ParseItemSetBuilder", []() {
        }, 0),
        Production{{}, 0}
      }},
-    }, {}, {}, {}, {}};
+    };

    auto production = [&](int variable_index, int production_index) -> const Production & {
      return grammar.variables[variable_index].productions[production_index];
--- a/test/compiler/prepare_grammar/expand_repeats_test.cc
+++ b/test/compiler/prepare_grammar/expand_repeats_test.cc
@ -11,11 +11,9 @@ START_TEST

 describe("expand_repeats", []() {
  it("replaces repeat rules with pairs of recursive rules", [&]() {
-    InitialSyntaxGrammar grammar{
-      {
-        Variable{"rule0", VariableTypeNamed, Repeat{Symbol::terminal(0)}},
-      },
-      {}, {}, {}, {}
+    InitialSyntaxGrammar grammar;
+    grammar.variables = {
+      Variable{"rule0", VariableTypeNamed, Repeat{Symbol::terminal(0)}},
    };

    auto result = expand_repeats(grammar);
@ -30,14 +28,12 @@ describe("expand_repeats", []() {
  });

  it("replaces repeats inside of sequences", [&]() {
-    InitialSyntaxGrammar grammar{
-      {
-        Variable{"rule0", VariableTypeNamed, Rule::seq({
-          Symbol::terminal(10),
-          Repeat{Symbol::terminal(11)},
-        })},
-      },
-      {}, {}, {}, {}
+    InitialSyntaxGrammar grammar;
+    grammar.variables = {
+      Variable{"rule0", VariableTypeNamed, Rule::seq({
+        Symbol::terminal(10),
+        Repeat{Symbol::terminal(11)},
+      })},
    };

    auto result = expand_repeats(grammar);
@ -55,14 +51,12 @@ describe("expand_repeats", []() {
  });

  it("replaces repeats inside of choices", [&]() {
-    InitialSyntaxGrammar grammar{
-      {
-        Variable{"rule0", VariableTypeNamed, Rule::choice({
-          Symbol::terminal(10),
-          Repeat{Symbol::terminal(11)}
-        })},
-      },
-      {}, {}, {}, {}
+    InitialSyntaxGrammar grammar;
+    grammar.variables = {
+      Variable{"rule0", VariableTypeNamed, Rule::choice({
+        Symbol::terminal(10),
+        Repeat{Symbol::terminal(11)}
+      })},
    };

    auto result = expand_repeats(grammar);
@ -80,18 +74,16 @@ describe("expand_repeats", []() {
  });

  it("does not create redundant auxiliary rules", [&]() {
-    InitialSyntaxGrammar grammar{
-      {
-        Variable{"rule0", VariableTypeNamed, Rule::choice({
-          Rule::seq({ Symbol::terminal(1), Repeat{Symbol::terminal(4)} }),
-          Rule::seq({ Symbol::terminal(2), Repeat{Symbol::terminal(4)} }),
-        })},
-        Variable{"rule1", VariableTypeNamed, Rule::seq({
-          Symbol::terminal(3),
-          Repeat{Symbol::terminal(4)}
-        })},
-      },
-      {}, {}, {}, {}
+    InitialSyntaxGrammar grammar;
+    grammar.variables = {
+      Variable{"rule0", VariableTypeNamed, Rule::choice({
+        Rule::seq({ Symbol::terminal(1), Repeat{Symbol::terminal(4)} }),
+        Rule::seq({ Symbol::terminal(2), Repeat{Symbol::terminal(4)} }),
+      })},
+      Variable{"rule1", VariableTypeNamed, Rule::seq({
+        Symbol::terminal(3),
+        Repeat{Symbol::terminal(4)}
+      })},
    };

    auto result = expand_repeats(grammar);
@ -113,14 +105,14 @@ describe("expand_repeats", []() {
  });

  it("can replace multiple repeats in the same rule", [&]() {
-    InitialSyntaxGrammar grammar{
+    InitialSyntaxGrammar grammar;
+    grammar.variables = {
      {
        Variable{"rule0", VariableTypeNamed, Rule::seq({
          Repeat{Symbol::terminal(10)},
          Repeat{Symbol::terminal(11)},
        })},
-      },
-      {}, {}, {}, {}
+      }
    };

    auto result = expand_repeats(grammar);
@ -142,12 +134,10 @@ describe("expand_repeats", []() {
  });

  it("can replace repeats in multiple rules", [&]() {
-    InitialSyntaxGrammar grammar{
-      {
-        Variable{"rule0", VariableTypeNamed, Repeat{Symbol::terminal(10)}},
-        Variable{"rule1", VariableTypeNamed, Repeat{Symbol::terminal(11)}},
-      },
-      {}, {}, {}, {}
+    InitialSyntaxGrammar grammar;
+    grammar.variables = {
+      Variable{"rule0", VariableTypeNamed, Repeat{Symbol::terminal(10)}},
+      Variable{"rule1", VariableTypeNamed, Repeat{Symbol::terminal(11)}},
    };

    auto result = expand_repeats(grammar);
--- a/test/compiler/prepare_grammar/intern_symbols_test.cc
+++ b/test/compiler/prepare_grammar/intern_symbols_test.cc
@ -11,13 +11,11 @@ using prepare_grammar::intern_symbols;

 describe("intern_symbols", []() {
  it("replaces named symbols with numerically-indexed symbols", [&]() {
-    InputGrammar grammar{
-      {
-        {"x", VariableTypeNamed, Rule::choice({ NamedSymbol{"y"}, NamedSymbol{"_z"} })},
-        {"y", VariableTypeNamed, NamedSymbol{"_z"}},
-        {"_z", VariableTypeNamed, String{"stuff"}}
-      },
-      {}, {}, {}, {}
+    InputGrammar grammar;
+    grammar.variables = {
+      {"x", VariableTypeNamed, Rule::choice({ NamedSymbol{"y"}, NamedSymbol{"_z"} })},
+      {"y", VariableTypeNamed, NamedSymbol{"_z"}},
+      {"_z", VariableTypeNamed, String{"stuff"}}
    };

    auto result = intern_symbols(grammar);
@ -32,11 +30,9 @@ describe("intern_symbols", []() {

  describe("when there are symbols that reference undefined rules", [&]() {
    it("returns an error", []() {
-      InputGrammar grammar{
-        {
-          {"x", VariableTypeNamed, NamedSymbol{"y"}},
-        },
-        {}, {}, {}, {}
+      InputGrammar grammar;
+      grammar.variables = {
+        {"x", VariableTypeNamed, NamedSymbol{"y"}},
      };

      auto result = intern_symbols(grammar);
@ -46,16 +42,14 @@ describe("intern_symbols", []() {
  });

  it("translates the grammar's optional 'extra_tokens' to numerical symbols", [&]() {
-    InputGrammar grammar{
-      {
-        {"x", VariableTypeNamed, Rule::choice({ NamedSymbol{"y"}, NamedSymbol{"z"} })},
-        {"y", VariableTypeNamed, NamedSymbol{"z"}},
-        {"z", VariableTypeNamed, String{"stuff"}}
-      },
-      {
-        NamedSymbol{"z"}
-      },
-      {}, {}, {}
+    InputGrammar grammar;
+    grammar.variables = {
+      {"x", VariableTypeNamed, Rule::choice({ NamedSymbol{"y"}, NamedSymbol{"z"} })},
+      {"y", VariableTypeNamed, NamedSymbol{"z"}},
+      {"z", VariableTypeNamed, String{"stuff"}}
+    };
+    grammar.extra_tokens = {
+      NamedSymbol{"z"}
    };

    auto result = intern_symbols(grammar);
@ -66,19 +60,15 @@ describe("intern_symbols", []() {
  });

  it("records any rule names that match external token names", [&]() {
-    InputGrammar grammar{
-      {
-        {"x", VariableTypeNamed, Rule::choice({ NamedSymbol{"y"}, NamedSymbol{"z"} })},
-        {"y", VariableTypeNamed, NamedSymbol{"z"}},
-        {"z", VariableTypeNamed, String{"stuff"}},
-      },
-      {},
-      {},
-      {
-        NamedSymbol{"w"},
-        NamedSymbol{"z"},
-      },
-      {}
+    InputGrammar grammar;
+    grammar.variables = {
+      {"x", VariableTypeNamed, Rule::choice({ NamedSymbol{"y"}, NamedSymbol{"z"} })},
+      {"y", VariableTypeNamed, NamedSymbol{"z"}},
+      {"z", VariableTypeNamed, String{"stuff"}},
+    };
+    grammar.external_tokens = {
+      NamedSymbol{"w"},
+      NamedSymbol{"z"},
    };

    auto result = intern_symbols(grammar);