diff --git a/docs/_layouts/default.html b/docs/_layouts/default.html
index a764485b..a11327b3 100644
--- a/docs/_layouts/default.html
+++ b/docs/_layouts/default.html
@@ -32,6 +32,7 @@
{% capture whitespace %}
{% assign min_header = 2 %}
+ {% assign max_header = 3 %}
{% assign nodes = content | split: "
maxHeader %}
+ {% if header_level < min_header or header_level > max_header %}
{% continue %}
{% endif %}
@@ -127,7 +128,7 @@
}
});
- $('h1, h2, h3, h4, h5, h6').filter('[id]').each(function() {
+ $('h1, h2, h3').filter('[id]').each(function() {
$(this).html('' + $(this).text() + '');
});
diff --git a/docs/section-3-creating-parsers.md b/docs/section-3-creating-parsers.md
index 268b034a..90411a55 100644
--- a/docs/section-3-creating-parsers.md
+++ b/docs/section-3-creating-parsers.md
@@ -211,12 +211,13 @@ The following is a complete list of built-in functions you can use to define Tre
* **Tokens : `token(rule)`** - This function marks the given rule as producing only a single token. Tree-sitter's default is to treat each String or RegExp literal in the grammar as a separate token. Each token is matched separately by the lexer and returned as its own leaf node in the tree. The `token` function allows you to express a complex rule using the functions described above (rather than as a single regular expression) but still have Tree-sitter treat it as a single token.
* **Aliases : `alias(rule, name)`** - This function causes the given rule to *appear* with an alternative name in the syntax tree. It is useful in cases where a language construct needs to be parsed differently in different contexts (and thus needs to be defined using multiple symbols), but should always *appear* as the same type of node.
-In addition to the `name` and `rules` fields, grammars have a few other public fields that influence the behavior of the parser.
+In addition to the `name` and `rules` fields, grammars have a few other optional public fields that influence the behavior of the parser.
* `extras` - an array of tokens that may appear *anywhere* in the language. This is often used for whitespace and comments. The default for `extras` in `tree-sitter-cli` is to accept whitespace. To control whitespace explicitly, specify `extras=[]` in the grammar.
* `inline` - an array of rule names that should be automatically *removed* from the grammar by replacing all of their usages with a copy of their definition. This is useful for rules that are used in multiple places but for which you *don't* want to create syntax tree nodes at runtime.
* `conflicts` - an array of arrays of rule names. Each inner array represents a set of rules that's involved in an *LR(1) conflict* that is *intended to exist* in the grammar. When these conflicts occur at runtime, Tree-sitter will use the GLR algorithm to explore all of the possible interpretations. If *multiple* parses end up succeeding, Tree-sitter will pick the subtree rule with the highest *dynamic precedence*.
* `externals` - an array of toen names which can be returned by an *external scanner*. External scanners allow you to write custom C code which runs during the lexing process in order to handle lexical rules (e.g. Python's indentation tokens) that cannot be described by regular expressions.
+* `word` - the name of a token that will match keywords for the purpose of the [keyword extraction](#keyword-extraction) optimization.
## Adjusting existing grammars
@@ -355,11 +356,81 @@ For an expression like `a * b * c`, it's not clear whether we mean `a * (b * c)`
You may have noticed in the above examples that some of the grammar rule name like `_expression` and `_type` began with an underscore. Starting a rule's name with an underscore causes the rule to be *hidden* in the syntax tree. This is useful for rules like `_expression` in the grammars above, which always just wrap a single child node. If these nodes were not hidden, they would add substantial depth and noise to the syntax tree without making it any easier to understand.
-## Dealing with LR conflicts
+### Dealing with LR conflicts
-TODO
+...
+## Lexical Analysis
+
+Tree-sitter's parsing process is divided into two phases: parsing (which is described above) and [lexing](lexing) - the process of grouping individual characters into the language's fundamental *tokens*. There are a few important things to know about how Tree-sitter's lexing works.
+
+### Conflict Resolution
+
+Grammars often contain multiple tokens that can match the same characters. For example, a grammar might contain the tokens (`"if"` and `/[a-z]+/`). Tree-sitter differentiates between these conflicting tokens in a few ways:
+
+1. **Context-aware lexing** - Tree-sitter performs lexing on-demand, during the parsing process. At any given position in a source document, the lexer only tries to recognize tokens that are *valid* at that position in the document.
+
+2. **Longest-match** - If multiple valid tokens match the characters at a given position in a document, Tree-sitter will select the token that matches the [longest sequence of characters](longest-match).
+
+3. **Lexical Precedence** - When the precedence functions described [above](#using-the-grammar-dsl) are used within the `token` function, the given precedence values serve as instructions to the lexer. If there are two valid tokens that match the same sequence of characters, Tree-sitter will select the one with the higher precedence.
+
+### Keywords
+
+If your language has keywords which are matched by a rule (typically `identifier`), you can tell Tree-sitter about it with your grammar's `word` property.
+
+```js
+grammar({
+ word: $ => $.identifier,
+
+ rules: {
+ class_declaration: $ => seq(
+ 'class',
+ $.identifier,
+ $.class_body
+ ),
+
+ break_statement: $ => seq('break', ';'),
+
+ continue_statement: $ => seq('continue', ';'),
+
+ identifier: $ => /[a-z]+/
+ }
+})
+```
+
+In this case, we're specifying `identifier` as our `word`. Tree-sitter will automatically find the set of terminals which are matched by `$.identifier`, and consider them keywords. Instead of generating a parser which scans for each keyword individually, Tree-sitter will generate a parser that tries to match the word rule (in this case, `identifier`), and checks to see if the matched word is the necessary keyword.
+
+This makes the set of parse states smaller, so the parser compiles faster.
+
+It *also changes behavior*. Consider this grammar:
+
+```js
+grammar({
+ rules: {
+ import: $ => seq(
+ 'import',
+ $.identifier,
+ 'as',
+ $.identifier
+ ),
+
+ identifier: $ => /[a-z]+/
+ }
+})
+```
+
+Without the `word` directive, the grammar matches this input:
+
+```
+import foo asbar
+```
+
+Which is probably not what you want. If we add `word: $ => $.identifier`, this will no longer parse. When we try to parse `'as'`, we will parse a word — which will be the identifier ``'asbar'``—and then compare it to `'as'`, correctly generating an error.
+
+[lexing]: https://en.wikipedia.org/wiki/Lexical_analysis
+[longest-match]: https://en.wikipedia.org/wiki/Maximal_munch
[cst]: https://en.wikipedia.org/wiki/Parse_tree
+[dfa]: https://en.wikipedia.org/wiki/Deterministic_finite_automaton
[non-terminal]: https://en.wikipedia.org/wiki/Terminal_and_nonterminal_symbols
[language-spec]: https://en.wikipedia.org/wiki/Programming_language_specification
[glr-parsing]: https://en.wikipedia.org/wiki/GLR_parser
diff --git a/include/tree_sitter/compiler.h b/include/tree_sitter/compiler.h
index ca2a28f7..3db2f7ca 100644
--- a/include/tree_sitter/compiler.h
+++ b/include/tree_sitter/compiler.h
@@ -19,6 +19,7 @@ typedef enum {
TSCompileErrorTypeEpsilonRule,
TSCompileErrorTypeInvalidTokenContents,
TSCompileErrorTypeInvalidRuleName,
+ TSCompileErrorTypeInvalidWordRule,
} TSCompileErrorType;
typedef struct {
diff --git a/src/compiler/build_tables/lex_table_builder.cc b/src/compiler/build_tables/lex_table_builder.cc
index 178cfb75..d0f363d1 100644
--- a/src/compiler/build_tables/lex_table_builder.cc
+++ b/src/compiler/build_tables/lex_table_builder.cc
@@ -49,6 +49,19 @@ using rules::Symbol;
using rules::Metadata;
using rules::Seq;
+enum ConflictStatus {
+ DoesNotMatch = 0,
+ MatchesShorterStringWithinSeparators = 1 << 0,
+ MatchesSameString = 1 << 1,
+ MatchesLongerString = 1 << 2,
+ MatchesLongerStringWithValidNextChar = 1 << 3,
+ CannotDistinguish = (
+ MatchesShorterStringWithinSeparators |
+ MatchesSameString |
+ MatchesLongerStringWithValidNextChar
+ ),
+};
+
static const std::unordered_set EMPTY;
bool CoincidentTokenIndex::contains(Symbol a, Symbol b) const {
@@ -65,14 +78,12 @@ const std::unordered_set &CoincidentTokenIndex::states_with(Symbol
}
}
-template
-class CharacterAggregator {
+class StartingCharacterAggregator {
public:
void apply(const Rule &rule) {
rule.match(
[this](const Seq &sequence) {
apply(*sequence.left);
- if (include_all) apply(*sequence.right);
},
[this](const rules::Choice &rule) {
@@ -91,9 +102,6 @@ class CharacterAggregator {
CharacterSet result;
};
-using StartingCharacterAggregator = CharacterAggregator;
-using AllCharacterAggregator = CharacterAggregator;
-
class LexTableBuilderImpl : public LexTableBuilder {
LexTable main_lex_table;
LexTable keyword_lex_table;
@@ -109,7 +117,7 @@ class LexTableBuilderImpl : public LexTableBuilder {
vector conflict_matrix;
bool conflict_detection_mode;
LookaheadSet keyword_symbols;
- Symbol keyword_capture_token;
+ Symbol word_rule;
char encoding_buffer[8];
public:
@@ -125,7 +133,7 @@ class LexTableBuilderImpl : public LexTableBuilder {
parse_table(parse_table),
conflict_matrix(lexical_grammar.variables.size() * lexical_grammar.variables.size(), DoesNotMatch),
conflict_detection_mode(false),
- keyword_capture_token(rules::NONE()) {
+ word_rule(syntax_grammar.word_rule) {
// Compute the possible separator rules and the set of separator characters that can occur
// immediately after any token.
@@ -141,7 +149,6 @@ class LexTableBuilderImpl : public LexTableBuilder {
// characters that can follow each token. Also identify all of the tokens that can be
// considered 'keywords'.
LOG_START("characterizing tokens");
- LookaheadSet potential_keyword_symbols;
for (unsigned i = 0, n = grammar.variables.size(); i < n; i++) {
Symbol token = Symbol::terminal(i);
@@ -158,31 +165,6 @@ class LexTableBuilderImpl : public LexTableBuilder {
});
}
following_characters_by_token[i] = following_character_aggregator.result;
-
- AllCharacterAggregator all_character_aggregator;
- all_character_aggregator.apply(grammar.variables[i].rule);
-
- if (
- !starting_character_aggregator.result.includes_all &&
- !all_character_aggregator.result.includes_all
- ) {
- bool starts_alpha = true, all_alnum = true;
- for (auto character : starting_character_aggregator.result.included_chars) {
- if (!iswalpha(character) && character != '_') {
- starts_alpha = false;
- }
- }
- for (auto character : all_character_aggregator.result.included_chars) {
- if (!iswalnum(character) && character != '_') {
- all_alnum = false;
- }
- }
- if (starts_alpha && all_alnum) {
- LOG("potential keyword: %s", token_name(token).c_str());
- potential_keyword_symbols.insert(token);
- }
- }
-
}
LOG_END();
@@ -205,98 +187,83 @@ class LexTableBuilderImpl : public LexTableBuilder {
}
LOG_END();
- LOG_START("finding keyword capture token");
- for (Symbol::Index i = 0, n = grammar.variables.size(); i < n; i++) {
- Symbol candidate = Symbol::terminal(i);
+ if (word_rule != rules::NONE()) {
+ identify_keywords();
+ }
+ }
- LookaheadSet homonyms;
- potential_keyword_symbols.for_each([&](Symbol other_token) {
- if (get_conflict_status(other_token, candidate) & MatchesShorterStringWithinSeparators) {
- homonyms.clear();
- return false;
- }
- if (get_conflict_status(candidate, other_token) == MatchesSameString) {
- homonyms.insert(other_token);
- }
- return true;
- });
- if (homonyms.empty()) continue;
-
- LOG_START(
- "keyword capture token candidate: %s, homonym count: %lu",
- token_name(candidate).c_str(),
- homonyms.size()
- );
-
- homonyms.for_each([&](Symbol homonym1) {
- homonyms.for_each([&](Symbol homonym2) {
- if (get_conflict_status(homonym1, homonym2) & MatchesSameString) {
- LOG(
- "conflict between homonyms %s %s",
- token_name(homonym1).c_str(),
- token_name(homonym2).c_str()
- );
- homonyms.remove(homonym1);
- }
- return false;
- });
- return true;
- });
-
- for (Symbol::Index j = 0; j < n; j++) {
- Symbol other_token = Symbol::terminal(j);
- if (other_token == candidate || homonyms.contains(other_token)) continue;
- bool candidate_shadows_other = get_conflict_status(other_token, candidate);
- bool other_shadows_candidate = get_conflict_status(candidate, other_token);
-
- if (candidate_shadows_other || other_shadows_candidate) {
- homonyms.for_each([&](Symbol homonym) {
- bool other_shadows_homonym = get_conflict_status(homonym, other_token);
-
- bool candidate_was_already_present = true;
- for (ParseStateId state_id : coincident_token_index.states_with(homonym, other_token)) {
- if (!parse_table->states[state_id].has_terminal_entry(candidate)) {
- candidate_was_already_present = false;
- break;
- }
- }
- if (candidate_was_already_present) return true;
-
- if (candidate_shadows_other) {
- homonyms.remove(homonym);
- LOG(
- "remove %s because candidate would shadow %s",
- token_name(homonym).c_str(),
- token_name(other_token).c_str()
- );
- } else if (other_shadows_candidate && !other_shadows_homonym) {
- homonyms.remove(homonym);
- LOG(
- "remove %s because %s would shadow candidate",
- token_name(homonym).c_str(),
- token_name(other_token).c_str()
- );
- }
- return true;
- });
- }
+ void identify_keywords() {
+ LookaheadSet homonyms;
+ for (Symbol::Index j = 0, n = grammar.variables.size(); j < n; j++) {
+ Symbol other_token = Symbol::terminal(j);
+ if (get_conflict_status(word_rule, other_token) == MatchesSameString) {
+ homonyms.insert(other_token);
}
-
- if (homonyms.size() > keyword_symbols.size()) {
- LOG_START("found capture token. homonyms:");
- homonyms.for_each([&](Symbol homonym) {
- LOG("%s", token_name(homonym).c_str());
- return true;
- });
- LOG_END();
- keyword_symbols = homonyms;
- keyword_capture_token = candidate;
- }
-
- LOG_END();
}
- LOG_END();
+ homonyms.for_each([&](Symbol homonym1) {
+ homonyms.for_each([&](Symbol homonym2) {
+ if (get_conflict_status(homonym1, homonym2) & MatchesSameString) {
+ LOG(
+ "conflict between homonyms %s %s",
+ token_name(homonym1).c_str(),
+ token_name(homonym2).c_str()
+ );
+ homonyms.remove(homonym1);
+ }
+ return false;
+ });
+ return true;
+ });
+
+ for (Symbol::Index j = 0, n = grammar.variables.size(); j < n; j++) {
+ Symbol other_token = Symbol::terminal(j);
+ if (other_token == word_rule || homonyms.contains(other_token)) continue;
+ bool word_rule_shadows_other = get_conflict_status(other_token, word_rule);
+ bool other_shadows_word_rule = get_conflict_status(word_rule, other_token);
+
+ if (word_rule_shadows_other || other_shadows_word_rule) {
+ homonyms.for_each([&](Symbol homonym) {
+ bool other_shadows_homonym = get_conflict_status(homonym, other_token);
+
+ bool word_rule_was_already_present = true;
+ for (ParseStateId state_id : coincident_token_index.states_with(homonym, other_token)) {
+ if (!parse_table->states[state_id].has_terminal_entry(word_rule)) {
+ word_rule_was_already_present = false;
+ break;
+ }
+ }
+ if (word_rule_was_already_present) return true;
+
+ if (word_rule_shadows_other) {
+ homonyms.remove(homonym);
+ LOG(
+ "remove %s because word_rule would shadow %s",
+ token_name(homonym).c_str(),
+ token_name(other_token).c_str()
+ );
+ } else if (other_shadows_word_rule && !other_shadows_homonym) {
+ homonyms.remove(homonym);
+ LOG(
+ "remove %s because %s would shadow word_rule",
+ token_name(homonym).c_str(),
+ token_name(other_token).c_str()
+ );
+ }
+ return true;
+ });
+ }
+ }
+
+ if (!homonyms.empty()) {
+ LOG_START("found keywords:");
+ homonyms.for_each([&](Symbol homonym) {
+ LOG("%s", token_name(homonym).c_str());
+ return true;
+ });
+ LOG_END();
+ keyword_symbols = homonyms;
+ }
}
BuildResult build() {
@@ -307,8 +274,8 @@ class LexTableBuilderImpl : public LexTableBuilder {
for (ParseState &parse_state : parse_table->states) {
LookaheadSet token_set;
for (auto &entry : parse_state.terminal_entries) {
- if (keyword_capture_token.is_terminal() && keyword_symbols.contains(entry.first)) {
- token_set.insert(keyword_capture_token);
+ if (word_rule.is_terminal() && keyword_symbols.contains(entry.first)) {
+ token_set.insert(word_rule);
} else {
token_set.insert(entry.first);
}
@@ -337,7 +304,19 @@ class LexTableBuilderImpl : public LexTableBuilder {
mark_fragile_tokens();
remove_duplicate_lex_states(main_lex_table);
- return {main_lex_table, keyword_lex_table, keyword_capture_token};
+ return {main_lex_table, keyword_lex_table, word_rule};
+ }
+
+ bool does_token_shadow_other(Symbol token, Symbol shadowed_token) const {
+ if (token == word_rule && keyword_symbols.contains(shadowed_token)) return false;
+ return get_conflict_status(shadowed_token, token) & (
+ MatchesShorterStringWithinSeparators |
+ MatchesLongerStringWithValidNextChar
+ );
+ }
+
+ bool does_token_match_same_string_as_other(Symbol token, Symbol shadowed_token) const {
+ return get_conflict_status(shadowed_token, token) & MatchesSameString;
}
ConflictStatus get_conflict_status(Symbol shadowed_token, Symbol other_token) const {
@@ -410,12 +389,14 @@ class LexTableBuilderImpl : public LexTableBuilder {
advance_symbol,
MatchesLongerStringWithValidNextChar
)) {
- LOG(
- "%s shadows %s followed by '%s'",
- token_name(advance_symbol).c_str(),
- token_name(accept_action.symbol).c_str(),
- log_char(*conflicting_following_chars.included_chars.begin())
- );
+ if (!conflicting_following_chars.included_chars.empty()) {
+ LOG(
+ "%s shadows %s followed by '%s'",
+ token_name(advance_symbol).c_str(),
+ token_name(accept_action.symbol).c_str(),
+ log_char(*conflicting_following_chars.included_chars.begin())
+ );
+ }
}
}
}
@@ -665,8 +646,12 @@ LexTableBuilder::BuildResult LexTableBuilder::build() {
return static_cast(this)->build();
}
-ConflictStatus LexTableBuilder::get_conflict_status(Symbol a, Symbol b) const {
- return static_cast(this)->get_conflict_status(a, b);
+bool LexTableBuilder::does_token_shadow_other(Symbol a, Symbol b) const {
+ return static_cast(this)->does_token_shadow_other(a, b);
+}
+
+bool LexTableBuilder::does_token_match_same_string_as_other(Symbol a, Symbol b) const {
+ return static_cast(this)->does_token_match_same_string_as_other(a, b);
}
} // namespace build_tables
diff --git a/src/compiler/build_tables/lex_table_builder.h b/src/compiler/build_tables/lex_table_builder.h
index 4ec4f22b..d69b996b 100644
--- a/src/compiler/build_tables/lex_table_builder.h
+++ b/src/compiler/build_tables/lex_table_builder.h
@@ -30,19 +30,6 @@ namespace build_tables {
class LookaheadSet;
-enum ConflictStatus {
- DoesNotMatch = 0,
- MatchesShorterStringWithinSeparators = 1 << 0,
- MatchesSameString = 1 << 1,
- MatchesLongerString = 1 << 2,
- MatchesLongerStringWithValidNextChar = 1 << 3,
- CannotDistinguish = (
- MatchesShorterStringWithinSeparators |
- MatchesSameString |
- MatchesLongerStringWithValidNextChar
- ),
-};
-
struct CoincidentTokenIndex {
std::unordered_map<
std::pair,
@@ -69,7 +56,8 @@ class LexTableBuilder {
BuildResult build();
- ConflictStatus get_conflict_status(rules::Symbol, rules::Symbol) const;
+ bool does_token_shadow_other(rules::Symbol, rules::Symbol) const;
+ bool does_token_match_same_string_as_other(rules::Symbol, rules::Symbol) const;
protected:
LexTableBuilder() = default;
diff --git a/src/compiler/build_tables/parse_table_builder.cc b/src/compiler/build_tables/parse_table_builder.cc
index 0e6b4247..26dae5b7 100644
--- a/src/compiler/build_tables/parse_table_builder.cc
+++ b/src/compiler/build_tables/parse_table_builder.cc
@@ -134,11 +134,6 @@ class ParseTableBuilderImpl : public ParseTableBuilder {
}
void build_error_parse_state(ParseStateId state_id) {
- unsigned CannotMerge = (
- MatchesShorterStringWithinSeparators |
- MatchesLongerStringWithValidNextChar
- );
-
parse_table.states[state_id].terminal_entries.clear();
// First, identify the conflict-free tokens.
@@ -149,7 +144,7 @@ class ParseTableBuilderImpl : public ParseTableBuilder {
for (unsigned j = 0; j < lexical_grammar.variables.size(); j++) {
Symbol other_token = Symbol::terminal(j);
if (!coincident_token_index.contains(token, other_token) &&
- (lex_table_builder->get_conflict_status(other_token, token) & CannotMerge)) {
+ lex_table_builder->does_token_shadow_other(token, other_token)) {
conflicts_with_other_tokens = true;
break;
}
@@ -171,7 +166,7 @@ class ParseTableBuilderImpl : public ParseTableBuilder {
bool conflicts_with_other_tokens = false;
conflict_free_tokens.for_each([&](Symbol other_token) {
if (!coincident_token_index.contains(token, other_token) &&
- (lex_table_builder->get_conflict_status(other_token, token) & CannotMerge)) {
+ lex_table_builder->does_token_shadow_other(token, other_token)) {
LOG(
"exclude %s: conflicts with %s",
symbol_name(token).c_str(),
@@ -517,7 +512,8 @@ class ParseTableBuilderImpl : public ParseTableBuilder {
// Do not add a token if it conflicts with an existing token.
if (!new_token.is_built_in()) {
for (const auto &entry : state.terminal_entries) {
- if (lex_table_builder->get_conflict_status(entry.first, new_token) & CannotDistinguish) {
+ if (lex_table_builder->does_token_shadow_other(new_token, entry.first) ||
+ lex_table_builder->does_token_match_same_string_as_other(new_token, entry.first)) {
LOG_IF(
logged_conflict_tokens.insert({entry.first, new_token}).second,
"cannot merge parse states due to token conflict: %s and %s",
diff --git a/src/compiler/grammar.h b/src/compiler/grammar.h
index 6c63340c..5e2212fb 100644
--- a/src/compiler/grammar.h
+++ b/src/compiler/grammar.h
@@ -32,6 +32,7 @@ struct InputGrammar {
std::vector> expected_conflicts;
std::vector external_tokens;
std::unordered_set variables_to_inline;
+ rules::NamedSymbol word_rule;
};
} // namespace tree_sitter
diff --git a/src/compiler/log.cc b/src/compiler/log.cc
index c0f3a03c..4b1e3dbf 100644
--- a/src/compiler/log.cc
+++ b/src/compiler/log.cc
@@ -1,4 +1,5 @@
#include "compiler/log.h"
+#include
static const char *SPACES = " ";
@@ -21,6 +22,7 @@ void _indent_logs() {
}
void _outdent_logs() {
+ assert(_indent_level > 0);
_indent_level--;
}
diff --git a/src/compiler/parse_grammar.cc b/src/compiler/parse_grammar.cc
index e233cfe0..f589d15a 100644
--- a/src/compiler/parse_grammar.cc
+++ b/src/compiler/parse_grammar.cc
@@ -229,7 +229,9 @@ ParseGrammarResult parse_grammar(const string &input) {
string error_message;
string name;
InputGrammar grammar;
- json_value name_json, rules_json, extras_json, conflicts_json, external_tokens_json, inline_rules_json;
+ json_value
+ name_json, rules_json, extras_json, conflicts_json, external_tokens_json,
+ inline_rules_json, word_rule_json;
json_settings settings = { 0, json_enable_comments, 0, 0, 0, 0 };
char parse_error[json_error_max];
@@ -359,6 +361,16 @@ ParseGrammarResult parse_grammar(const string &input) {
}
}
+ word_rule_json = grammar_json->operator[]("word");
+ if (word_rule_json.type != json_none) {
+ if (word_rule_json.type != json_string) {
+ error_message = "Invalid word property";
+ goto error;
+ }
+
+ grammar.word_rule = NamedSymbol { word_rule_json.u.string.ptr };
+ }
+
json_value_free(grammar_json);
return { name, grammar, "" };
diff --git a/src/compiler/prepare_grammar/expand_repeats.cc b/src/compiler/prepare_grammar/expand_repeats.cc
index 46230867..42878376 100644
--- a/src/compiler/prepare_grammar/expand_repeats.cc
+++ b/src/compiler/prepare_grammar/expand_repeats.cc
@@ -106,6 +106,7 @@ InitialSyntaxGrammar expand_repeats(const InitialSyntaxGrammar &grammar) {
expander.aux_rules.end()
);
+ result.word_rule = grammar.word_rule;
return result;
}
diff --git a/src/compiler/prepare_grammar/extract_tokens.cc b/src/compiler/prepare_grammar/extract_tokens.cc
index 93b06be2..c82b3505 100644
--- a/src/compiler/prepare_grammar/extract_tokens.cc
+++ b/src/compiler/prepare_grammar/extract_tokens.cc
@@ -329,6 +329,18 @@ tuple extract_tokens(
}
}
+ syntax_grammar.word_rule = symbol_replacer.replace_symbol(grammar.word_rule);
+ if (syntax_grammar.word_rule.is_non_terminal()) {
+ return make_tuple(
+ syntax_grammar,
+ lexical_grammar,
+ CompileError(
+ TSCompileErrorTypeInvalidWordRule,
+ "Word rules must be tokens"
+ )
+ );
+ }
+
return make_tuple(syntax_grammar, lexical_grammar, CompileError::none());
}
diff --git a/src/compiler/prepare_grammar/flatten_grammar.cc b/src/compiler/prepare_grammar/flatten_grammar.cc
index e135ee67..ebfc3ae4 100644
--- a/src/compiler/prepare_grammar/flatten_grammar.cc
+++ b/src/compiler/prepare_grammar/flatten_grammar.cc
@@ -161,6 +161,8 @@ pair flatten_grammar(const InitialSyntaxGrammar &gr
i++;
}
+ result.word_rule = grammar.word_rule;
+
return {result, CompileError::none()};
}
diff --git a/src/compiler/prepare_grammar/initial_syntax_grammar.h b/src/compiler/prepare_grammar/initial_syntax_grammar.h
index 881c6396..4f21e3cd 100644
--- a/src/compiler/prepare_grammar/initial_syntax_grammar.h
+++ b/src/compiler/prepare_grammar/initial_syntax_grammar.h
@@ -17,6 +17,7 @@ struct InitialSyntaxGrammar {
std::set> expected_conflicts;
std::vector external_tokens;
std::set variables_to_inline;
+ rules::Symbol word_rule;
};
} // namespace prepare_grammar
diff --git a/src/compiler/prepare_grammar/intern_symbols.cc b/src/compiler/prepare_grammar/intern_symbols.cc
index 4e610960..dc128779 100644
--- a/src/compiler/prepare_grammar/intern_symbols.cc
+++ b/src/compiler/prepare_grammar/intern_symbols.cc
@@ -166,6 +166,8 @@ pair intern_symbols(const InputGrammar &grammar)
}
}
+ result.word_rule = interner.intern_symbol(grammar.word_rule);
+
return {result, CompileError::none()};
}
diff --git a/src/compiler/prepare_grammar/interned_grammar.h b/src/compiler/prepare_grammar/interned_grammar.h
index 83117ced..fc322522 100644
--- a/src/compiler/prepare_grammar/interned_grammar.h
+++ b/src/compiler/prepare_grammar/interned_grammar.h
@@ -15,8 +15,8 @@ struct InternedGrammar {
std::vector extra_tokens;
std::set> expected_conflicts;
std::vector external_tokens;
- std::set blank_external_tokens;
std::set variables_to_inline;
+ rules::Symbol word_rule;
};
} // namespace prepare_grammar
diff --git a/src/compiler/syntax_grammar.h b/src/compiler/syntax_grammar.h
index 2d55686b..7d2b1be1 100644
--- a/src/compiler/syntax_grammar.h
+++ b/src/compiler/syntax_grammar.h
@@ -60,6 +60,7 @@ struct SyntaxGrammar {
std::set> expected_conflicts;
std::vector external_tokens;
std::set variables_to_inline;
+ rules::Symbol word_rule;
};
} // namespace tree_sitter
diff --git a/src/runtime/array.h b/src/runtime/array.h
index 45b3adaa..b32487a2 100644
--- a/src/runtime/array.h
+++ b/src/runtime/array.h
@@ -110,7 +110,7 @@ static inline void array__grow(VoidArray *self, size_t element_size) {
static inline void array__splice(VoidArray *self, size_t element_size,
uint32_t index, uint32_t old_count,
- uint32_t new_count, void *elements) {
+ uint32_t new_count, const void *elements) {
uint32_t new_size = self->size + new_count - old_count;
uint32_t old_end = index + old_count;
uint32_t new_end = index + new_count;
diff --git a/src/runtime/node.c b/src/runtime/node.c
index 607cf9de..0855ec66 100644
--- a/src/runtime/node.c
+++ b/src/runtime/node.c
@@ -28,11 +28,11 @@ static inline TSNode ts_node__null() {
// TSNode - accessors
-uint32_t ts_node_start_byte(const TSNode self) {
+uint32_t ts_node_start_byte(TSNode self) {
return self.context[0];
}
-TSPoint ts_node_start_point(const TSNode self) {
+TSPoint ts_node_start_point(TSNode self) {
return (TSPoint) {self.context[1], self.context[2]};
}
diff --git a/src/runtime/subtree.c b/src/runtime/subtree.c
index 9b2d954f..83a1ad4e 100644
--- a/src/runtime/subtree.c
+++ b/src/runtime/subtree.c
@@ -59,7 +59,7 @@ bool ts_external_scanner_state_eq(const ExternalScannerState *a, const ExternalS
// SubtreeArray
bool ts_subtree_array_copy(SubtreeArray self, SubtreeArray *dest) {
- const Subtree **contents = NULL;
+ Subtree **contents = NULL;
if (self.capacity > 0) {
contents = ts_calloc(self.capacity, sizeof(Subtree *));
memcpy(contents, self.contents, self.size * sizeof(Subtree *));
diff --git a/test/compiler/build_tables/parse_item_set_builder_test.cc b/test/compiler/build_tables/parse_item_set_builder_test.cc
index 6cf5bb0e..6c41c3ca 100644
--- a/test/compiler/build_tables/parse_item_set_builder_test.cc
+++ b/test/compiler/build_tables/parse_item_set_builder_test.cc
@@ -25,7 +25,8 @@ describe("ParseItemSetBuilder", []() {
LexicalGrammar lexical_grammar{lexical_variables, {}};
it("adds items at the beginnings of referenced rules", [&]() {
- SyntaxGrammar grammar{{
+ SyntaxGrammar grammar;
+ grammar.variables = {
SyntaxVariable{"rule0", VariableTypeNamed, {
Production({
{Symbol::non_terminal(1), 0, AssociativityNone, Alias{}},
@@ -47,7 +48,7 @@ describe("ParseItemSetBuilder", []() {
{Symbol::terminal(15), 0, AssociativityNone, Alias{}},
}, 0)
}},
- }, {}, {}, {}, {}};
+ };
auto production = [&](int variable_index, int production_index) -> const Production & {
return grammar.variables[variable_index].productions[production_index];
@@ -84,7 +85,8 @@ describe("ParseItemSetBuilder", []() {
});
it("handles rules with empty productions", [&]() {
- SyntaxGrammar grammar{{
+ SyntaxGrammar grammar;
+ grammar.variables = {
SyntaxVariable{"rule0", VariableTypeNamed, {
Production({
{Symbol::non_terminal(1), 0, AssociativityNone, Alias{}},
@@ -98,7 +100,7 @@ describe("ParseItemSetBuilder", []() {
}, 0),
Production{{}, 0}
}},
- }, {}, {}, {}, {}};
+ };
auto production = [&](int variable_index, int production_index) -> const Production & {
return grammar.variables[variable_index].productions[production_index];
diff --git a/test/compiler/prepare_grammar/expand_repeats_test.cc b/test/compiler/prepare_grammar/expand_repeats_test.cc
index 250bd59b..f7aaa8fe 100644
--- a/test/compiler/prepare_grammar/expand_repeats_test.cc
+++ b/test/compiler/prepare_grammar/expand_repeats_test.cc
@@ -11,11 +11,9 @@ START_TEST
describe("expand_repeats", []() {
it("replaces repeat rules with pairs of recursive rules", [&]() {
- InitialSyntaxGrammar grammar{
- {
- Variable{"rule0", VariableTypeNamed, Repeat{Symbol::terminal(0)}},
- },
- {}, {}, {}, {}
+ InitialSyntaxGrammar grammar;
+ grammar.variables = {
+ Variable{"rule0", VariableTypeNamed, Repeat{Symbol::terminal(0)}},
};
auto result = expand_repeats(grammar);
@@ -30,14 +28,12 @@ describe("expand_repeats", []() {
});
it("replaces repeats inside of sequences", [&]() {
- InitialSyntaxGrammar grammar{
- {
- Variable{"rule0", VariableTypeNamed, Rule::seq({
- Symbol::terminal(10),
- Repeat{Symbol::terminal(11)},
- })},
- },
- {}, {}, {}, {}
+ InitialSyntaxGrammar grammar;
+ grammar.variables = {
+ Variable{"rule0", VariableTypeNamed, Rule::seq({
+ Symbol::terminal(10),
+ Repeat{Symbol::terminal(11)},
+ })},
};
auto result = expand_repeats(grammar);
@@ -55,14 +51,12 @@ describe("expand_repeats", []() {
});
it("replaces repeats inside of choices", [&]() {
- InitialSyntaxGrammar grammar{
- {
- Variable{"rule0", VariableTypeNamed, Rule::choice({
- Symbol::terminal(10),
- Repeat{Symbol::terminal(11)}
- })},
- },
- {}, {}, {}, {}
+ InitialSyntaxGrammar grammar;
+ grammar.variables = {
+ Variable{"rule0", VariableTypeNamed, Rule::choice({
+ Symbol::terminal(10),
+ Repeat{Symbol::terminal(11)}
+ })},
};
auto result = expand_repeats(grammar);
@@ -80,18 +74,16 @@ describe("expand_repeats", []() {
});
it("does not create redundant auxiliary rules", [&]() {
- InitialSyntaxGrammar grammar{
- {
- Variable{"rule0", VariableTypeNamed, Rule::choice({
- Rule::seq({ Symbol::terminal(1), Repeat{Symbol::terminal(4)} }),
- Rule::seq({ Symbol::terminal(2), Repeat{Symbol::terminal(4)} }),
- })},
- Variable{"rule1", VariableTypeNamed, Rule::seq({
- Symbol::terminal(3),
- Repeat{Symbol::terminal(4)}
- })},
- },
- {}, {}, {}, {}
+ InitialSyntaxGrammar grammar;
+ grammar.variables = {
+ Variable{"rule0", VariableTypeNamed, Rule::choice({
+ Rule::seq({ Symbol::terminal(1), Repeat{Symbol::terminal(4)} }),
+ Rule::seq({ Symbol::terminal(2), Repeat{Symbol::terminal(4)} }),
+ })},
+ Variable{"rule1", VariableTypeNamed, Rule::seq({
+ Symbol::terminal(3),
+ Repeat{Symbol::terminal(4)}
+ })},
};
auto result = expand_repeats(grammar);
@@ -113,14 +105,14 @@ describe("expand_repeats", []() {
});
it("can replace multiple repeats in the same rule", [&]() {
- InitialSyntaxGrammar grammar{
+ InitialSyntaxGrammar grammar;
+ grammar.variables = {
{
Variable{"rule0", VariableTypeNamed, Rule::seq({
Repeat{Symbol::terminal(10)},
Repeat{Symbol::terminal(11)},
})},
- },
- {}, {}, {}, {}
+ }
};
auto result = expand_repeats(grammar);
@@ -142,12 +134,10 @@ describe("expand_repeats", []() {
});
it("can replace repeats in multiple rules", [&]() {
- InitialSyntaxGrammar grammar{
- {
- Variable{"rule0", VariableTypeNamed, Repeat{Symbol::terminal(10)}},
- Variable{"rule1", VariableTypeNamed, Repeat{Symbol::terminal(11)}},
- },
- {}, {}, {}, {}
+ InitialSyntaxGrammar grammar;
+ grammar.variables = {
+ Variable{"rule0", VariableTypeNamed, Repeat{Symbol::terminal(10)}},
+ Variable{"rule1", VariableTypeNamed, Repeat{Symbol::terminal(11)}},
};
auto result = expand_repeats(grammar);
diff --git a/test/compiler/prepare_grammar/intern_symbols_test.cc b/test/compiler/prepare_grammar/intern_symbols_test.cc
index 65bad45e..7b7f3624 100644
--- a/test/compiler/prepare_grammar/intern_symbols_test.cc
+++ b/test/compiler/prepare_grammar/intern_symbols_test.cc
@@ -11,13 +11,11 @@ using prepare_grammar::intern_symbols;
describe("intern_symbols", []() {
it("replaces named symbols with numerically-indexed symbols", [&]() {
- InputGrammar grammar{
- {
- {"x", VariableTypeNamed, Rule::choice({ NamedSymbol{"y"}, NamedSymbol{"_z"} })},
- {"y", VariableTypeNamed, NamedSymbol{"_z"}},
- {"_z", VariableTypeNamed, String{"stuff"}}
- },
- {}, {}, {}, {}
+ InputGrammar grammar;
+ grammar.variables = {
+ {"x", VariableTypeNamed, Rule::choice({ NamedSymbol{"y"}, NamedSymbol{"_z"} })},
+ {"y", VariableTypeNamed, NamedSymbol{"_z"}},
+ {"_z", VariableTypeNamed, String{"stuff"}}
};
auto result = intern_symbols(grammar);
@@ -32,11 +30,9 @@ describe("intern_symbols", []() {
describe("when there are symbols that reference undefined rules", [&]() {
it("returns an error", []() {
- InputGrammar grammar{
- {
- {"x", VariableTypeNamed, NamedSymbol{"y"}},
- },
- {}, {}, {}, {}
+ InputGrammar grammar;
+ grammar.variables = {
+ {"x", VariableTypeNamed, NamedSymbol{"y"}},
};
auto result = intern_symbols(grammar);
@@ -46,16 +42,14 @@ describe("intern_symbols", []() {
});
it("translates the grammar's optional 'extra_tokens' to numerical symbols", [&]() {
- InputGrammar grammar{
- {
- {"x", VariableTypeNamed, Rule::choice({ NamedSymbol{"y"}, NamedSymbol{"z"} })},
- {"y", VariableTypeNamed, NamedSymbol{"z"}},
- {"z", VariableTypeNamed, String{"stuff"}}
- },
- {
- NamedSymbol{"z"}
- },
- {}, {}, {}
+ InputGrammar grammar;
+ grammar.variables = {
+ {"x", VariableTypeNamed, Rule::choice({ NamedSymbol{"y"}, NamedSymbol{"z"} })},
+ {"y", VariableTypeNamed, NamedSymbol{"z"}},
+ {"z", VariableTypeNamed, String{"stuff"}}
+ };
+ grammar.extra_tokens = {
+ NamedSymbol{"z"}
};
auto result = intern_symbols(grammar);
@@ -66,19 +60,15 @@ describe("intern_symbols", []() {
});
it("records any rule names that match external token names", [&]() {
- InputGrammar grammar{
- {
- {"x", VariableTypeNamed, Rule::choice({ NamedSymbol{"y"}, NamedSymbol{"z"} })},
- {"y", VariableTypeNamed, NamedSymbol{"z"}},
- {"z", VariableTypeNamed, String{"stuff"}},
- },
- {},
- {},
- {
- NamedSymbol{"w"},
- NamedSymbol{"z"},
- },
- {}
+ InputGrammar grammar;
+ grammar.variables = {
+ {"x", VariableTypeNamed, Rule::choice({ NamedSymbol{"y"}, NamedSymbol{"z"} })},
+ {"y", VariableTypeNamed, NamedSymbol{"z"}},
+ {"z", VariableTypeNamed, String{"stuff"}},
+ };
+ grammar.external_tokens = {
+ NamedSymbol{"w"},
+ NamedSymbol{"z"},
};
auto result = intern_symbols(grammar);