diff --git a/docs/_layouts/default.html b/docs/_layouts/default.html index a764485b..a11327b3 100644 --- a/docs/_layouts/default.html +++ b/docs/_layouts/default.html @@ -32,6 +32,7 @@
{% capture whitespace %} {% assign min_header = 2 %} + {% assign max_header = 3 %} {% assign nodes = content | split: " maxHeader %} + {% if header_level < min_header or header_level > max_header %} {% continue %} {% endif %} @@ -127,7 +128,7 @@ } }); - $('h1, h2, h3, h4, h5, h6').filter('[id]').each(function() { + $('h1, h2, h3').filter('[id]').each(function() { $(this).html('' + $(this).text() + ''); }); diff --git a/docs/section-3-creating-parsers.md b/docs/section-3-creating-parsers.md index 268b034a..90411a55 100644 --- a/docs/section-3-creating-parsers.md +++ b/docs/section-3-creating-parsers.md @@ -211,12 +211,13 @@ The following is a complete list of built-in functions you can use to define Tre * **Tokens : `token(rule)`** - This function marks the given rule as producing only a single token. Tree-sitter's default is to treat each String or RegExp literal in the grammar as a separate token. Each token is matched separately by the lexer and returned as its own leaf node in the tree. The `token` function allows you to express a complex rule using the functions described above (rather than as a single regular expression) but still have Tree-sitter treat it as a single token. * **Aliases : `alias(rule, name)`** - This function causes the given rule to *appear* with an alternative name in the syntax tree. It is useful in cases where a language construct needs to be parsed differently in different contexts (and thus needs to be defined using multiple symbols), but should always *appear* as the same type of node. -In addition to the `name` and `rules` fields, grammars have a few other public fields that influence the behavior of the parser. +In addition to the `name` and `rules` fields, grammars have a few other optional public fields that influence the behavior of the parser. * `extras` - an array of tokens that may appear *anywhere* in the language. This is often used for whitespace and comments. The default for `extras` in `tree-sitter-cli` is to accept whitespace. To control whitespace explicitly, specify `extras=[]` in the grammar. * `inline` - an array of rule names that should be automatically *removed* from the grammar by replacing all of their usages with a copy of their definition. This is useful for rules that are used in multiple places but for which you *don't* want to create syntax tree nodes at runtime. * `conflicts` - an array of arrays of rule names. Each inner array represents a set of rules that's involved in an *LR(1) conflict* that is *intended to exist* in the grammar. When these conflicts occur at runtime, Tree-sitter will use the GLR algorithm to explore all of the possible interpretations. If *multiple* parses end up succeeding, Tree-sitter will pick the subtree rule with the highest *dynamic precedence*. * `externals` - an array of toen names which can be returned by an *external scanner*. External scanners allow you to write custom C code which runs during the lexing process in order to handle lexical rules (e.g. Python's indentation tokens) that cannot be described by regular expressions. +* `word` - the name of a token that will match keywords for the purpose of the [keyword extraction](#keyword-extraction) optimization. ## Adjusting existing grammars @@ -355,11 +356,81 @@ For an expression like `a * b * c`, it's not clear whether we mean `a * (b * c)` You may have noticed in the above examples that some of the grammar rule name like `_expression` and `_type` began with an underscore. Starting a rule's name with an underscore causes the rule to be *hidden* in the syntax tree. This is useful for rules like `_expression` in the grammars above, which always just wrap a single child node. If these nodes were not hidden, they would add substantial depth and noise to the syntax tree without making it any easier to understand. -## Dealing with LR conflicts +### Dealing with LR conflicts -TODO +... +## Lexical Analysis + +Tree-sitter's parsing process is divided into two phases: parsing (which is described above) and [lexing](lexing) - the process of grouping individual characters into the language's fundamental *tokens*. There are a few important things to know about how Tree-sitter's lexing works. + +### Conflict Resolution + +Grammars often contain multiple tokens that can match the same characters. For example, a grammar might contain the tokens (`"if"` and `/[a-z]+/`). Tree-sitter differentiates between these conflicting tokens in a few ways: + +1. **Context-aware lexing** - Tree-sitter performs lexing on-demand, during the parsing process. At any given position in a source document, the lexer only tries to recognize tokens that are *valid* at that position in the document. + +2. **Longest-match** - If multiple valid tokens match the characters at a given position in a document, Tree-sitter will select the token that matches the [longest sequence of characters](longest-match). + +3. **Lexical Precedence** - When the precedence functions described [above](#using-the-grammar-dsl) are used within the `token` function, the given precedence values serve as instructions to the lexer. If there are two valid tokens that match the same sequence of characters, Tree-sitter will select the one with the higher precedence. + +### Keywords + +If your language has keywords which are matched by a rule (typically `identifier`), you can tell Tree-sitter about it with your grammar's `word` property. + +```js +grammar({ + word: $ => $.identifier, + + rules: { + class_declaration: $ => seq( + 'class', + $.identifier, + $.class_body + ), + + break_statement: $ => seq('break', ';'), + + continue_statement: $ => seq('continue', ';'), + + identifier: $ => /[a-z]+/ + } +}) +``` + +In this case, we're specifying `identifier` as our `word`. Tree-sitter will automatically find the set of terminals which are matched by `$.identifier`, and consider them keywords. Instead of generating a parser which scans for each keyword individually, Tree-sitter will generate a parser that tries to match the word rule (in this case, `identifier`), and checks to see if the matched word is the necessary keyword. + +This makes the set of parse states smaller, so the parser compiles faster. + +It *also changes behavior*. Consider this grammar: + +```js +grammar({ + rules: { + import: $ => seq( + 'import', + $.identifier, + 'as', + $.identifier + ), + + identifier: $ => /[a-z]+/ + } +}) +``` + +Without the `word` directive, the grammar matches this input: + +``` +import foo asbar +``` + +Which is probably not what you want. If we add `word: $ => $.identifier`, this will no longer parse. When we try to parse `'as'`, we will parse a word — which will be the identifier ``'asbar'``—and then compare it to `'as'`, correctly generating an error. + +[lexing]: https://en.wikipedia.org/wiki/Lexical_analysis +[longest-match]: https://en.wikipedia.org/wiki/Maximal_munch [cst]: https://en.wikipedia.org/wiki/Parse_tree +[dfa]: https://en.wikipedia.org/wiki/Deterministic_finite_automaton [non-terminal]: https://en.wikipedia.org/wiki/Terminal_and_nonterminal_symbols [language-spec]: https://en.wikipedia.org/wiki/Programming_language_specification [glr-parsing]: https://en.wikipedia.org/wiki/GLR_parser diff --git a/include/tree_sitter/compiler.h b/include/tree_sitter/compiler.h index ca2a28f7..3db2f7ca 100644 --- a/include/tree_sitter/compiler.h +++ b/include/tree_sitter/compiler.h @@ -19,6 +19,7 @@ typedef enum { TSCompileErrorTypeEpsilonRule, TSCompileErrorTypeInvalidTokenContents, TSCompileErrorTypeInvalidRuleName, + TSCompileErrorTypeInvalidWordRule, } TSCompileErrorType; typedef struct { diff --git a/src/compiler/build_tables/lex_table_builder.cc b/src/compiler/build_tables/lex_table_builder.cc index 178cfb75..d0f363d1 100644 --- a/src/compiler/build_tables/lex_table_builder.cc +++ b/src/compiler/build_tables/lex_table_builder.cc @@ -49,6 +49,19 @@ using rules::Symbol; using rules::Metadata; using rules::Seq; +enum ConflictStatus { + DoesNotMatch = 0, + MatchesShorterStringWithinSeparators = 1 << 0, + MatchesSameString = 1 << 1, + MatchesLongerString = 1 << 2, + MatchesLongerStringWithValidNextChar = 1 << 3, + CannotDistinguish = ( + MatchesShorterStringWithinSeparators | + MatchesSameString | + MatchesLongerStringWithValidNextChar + ), +}; + static const std::unordered_set EMPTY; bool CoincidentTokenIndex::contains(Symbol a, Symbol b) const { @@ -65,14 +78,12 @@ const std::unordered_set &CoincidentTokenIndex::states_with(Symbol } } -template -class CharacterAggregator { +class StartingCharacterAggregator { public: void apply(const Rule &rule) { rule.match( [this](const Seq &sequence) { apply(*sequence.left); - if (include_all) apply(*sequence.right); }, [this](const rules::Choice &rule) { @@ -91,9 +102,6 @@ class CharacterAggregator { CharacterSet result; }; -using StartingCharacterAggregator = CharacterAggregator; -using AllCharacterAggregator = CharacterAggregator; - class LexTableBuilderImpl : public LexTableBuilder { LexTable main_lex_table; LexTable keyword_lex_table; @@ -109,7 +117,7 @@ class LexTableBuilderImpl : public LexTableBuilder { vector conflict_matrix; bool conflict_detection_mode; LookaheadSet keyword_symbols; - Symbol keyword_capture_token; + Symbol word_rule; char encoding_buffer[8]; public: @@ -125,7 +133,7 @@ class LexTableBuilderImpl : public LexTableBuilder { parse_table(parse_table), conflict_matrix(lexical_grammar.variables.size() * lexical_grammar.variables.size(), DoesNotMatch), conflict_detection_mode(false), - keyword_capture_token(rules::NONE()) { + word_rule(syntax_grammar.word_rule) { // Compute the possible separator rules and the set of separator characters that can occur // immediately after any token. @@ -141,7 +149,6 @@ class LexTableBuilderImpl : public LexTableBuilder { // characters that can follow each token. Also identify all of the tokens that can be // considered 'keywords'. LOG_START("characterizing tokens"); - LookaheadSet potential_keyword_symbols; for (unsigned i = 0, n = grammar.variables.size(); i < n; i++) { Symbol token = Symbol::terminal(i); @@ -158,31 +165,6 @@ class LexTableBuilderImpl : public LexTableBuilder { }); } following_characters_by_token[i] = following_character_aggregator.result; - - AllCharacterAggregator all_character_aggregator; - all_character_aggregator.apply(grammar.variables[i].rule); - - if ( - !starting_character_aggregator.result.includes_all && - !all_character_aggregator.result.includes_all - ) { - bool starts_alpha = true, all_alnum = true; - for (auto character : starting_character_aggregator.result.included_chars) { - if (!iswalpha(character) && character != '_') { - starts_alpha = false; - } - } - for (auto character : all_character_aggregator.result.included_chars) { - if (!iswalnum(character) && character != '_') { - all_alnum = false; - } - } - if (starts_alpha && all_alnum) { - LOG("potential keyword: %s", token_name(token).c_str()); - potential_keyword_symbols.insert(token); - } - } - } LOG_END(); @@ -205,98 +187,83 @@ class LexTableBuilderImpl : public LexTableBuilder { } LOG_END(); - LOG_START("finding keyword capture token"); - for (Symbol::Index i = 0, n = grammar.variables.size(); i < n; i++) { - Symbol candidate = Symbol::terminal(i); + if (word_rule != rules::NONE()) { + identify_keywords(); + } + } - LookaheadSet homonyms; - potential_keyword_symbols.for_each([&](Symbol other_token) { - if (get_conflict_status(other_token, candidate) & MatchesShorterStringWithinSeparators) { - homonyms.clear(); - return false; - } - if (get_conflict_status(candidate, other_token) == MatchesSameString) { - homonyms.insert(other_token); - } - return true; - }); - if (homonyms.empty()) continue; - - LOG_START( - "keyword capture token candidate: %s, homonym count: %lu", - token_name(candidate).c_str(), - homonyms.size() - ); - - homonyms.for_each([&](Symbol homonym1) { - homonyms.for_each([&](Symbol homonym2) { - if (get_conflict_status(homonym1, homonym2) & MatchesSameString) { - LOG( - "conflict between homonyms %s %s", - token_name(homonym1).c_str(), - token_name(homonym2).c_str() - ); - homonyms.remove(homonym1); - } - return false; - }); - return true; - }); - - for (Symbol::Index j = 0; j < n; j++) { - Symbol other_token = Symbol::terminal(j); - if (other_token == candidate || homonyms.contains(other_token)) continue; - bool candidate_shadows_other = get_conflict_status(other_token, candidate); - bool other_shadows_candidate = get_conflict_status(candidate, other_token); - - if (candidate_shadows_other || other_shadows_candidate) { - homonyms.for_each([&](Symbol homonym) { - bool other_shadows_homonym = get_conflict_status(homonym, other_token); - - bool candidate_was_already_present = true; - for (ParseStateId state_id : coincident_token_index.states_with(homonym, other_token)) { - if (!parse_table->states[state_id].has_terminal_entry(candidate)) { - candidate_was_already_present = false; - break; - } - } - if (candidate_was_already_present) return true; - - if (candidate_shadows_other) { - homonyms.remove(homonym); - LOG( - "remove %s because candidate would shadow %s", - token_name(homonym).c_str(), - token_name(other_token).c_str() - ); - } else if (other_shadows_candidate && !other_shadows_homonym) { - homonyms.remove(homonym); - LOG( - "remove %s because %s would shadow candidate", - token_name(homonym).c_str(), - token_name(other_token).c_str() - ); - } - return true; - }); - } + void identify_keywords() { + LookaheadSet homonyms; + for (Symbol::Index j = 0, n = grammar.variables.size(); j < n; j++) { + Symbol other_token = Symbol::terminal(j); + if (get_conflict_status(word_rule, other_token) == MatchesSameString) { + homonyms.insert(other_token); } - - if (homonyms.size() > keyword_symbols.size()) { - LOG_START("found capture token. homonyms:"); - homonyms.for_each([&](Symbol homonym) { - LOG("%s", token_name(homonym).c_str()); - return true; - }); - LOG_END(); - keyword_symbols = homonyms; - keyword_capture_token = candidate; - } - - LOG_END(); } - LOG_END(); + homonyms.for_each([&](Symbol homonym1) { + homonyms.for_each([&](Symbol homonym2) { + if (get_conflict_status(homonym1, homonym2) & MatchesSameString) { + LOG( + "conflict between homonyms %s %s", + token_name(homonym1).c_str(), + token_name(homonym2).c_str() + ); + homonyms.remove(homonym1); + } + return false; + }); + return true; + }); + + for (Symbol::Index j = 0, n = grammar.variables.size(); j < n; j++) { + Symbol other_token = Symbol::terminal(j); + if (other_token == word_rule || homonyms.contains(other_token)) continue; + bool word_rule_shadows_other = get_conflict_status(other_token, word_rule); + bool other_shadows_word_rule = get_conflict_status(word_rule, other_token); + + if (word_rule_shadows_other || other_shadows_word_rule) { + homonyms.for_each([&](Symbol homonym) { + bool other_shadows_homonym = get_conflict_status(homonym, other_token); + + bool word_rule_was_already_present = true; + for (ParseStateId state_id : coincident_token_index.states_with(homonym, other_token)) { + if (!parse_table->states[state_id].has_terminal_entry(word_rule)) { + word_rule_was_already_present = false; + break; + } + } + if (word_rule_was_already_present) return true; + + if (word_rule_shadows_other) { + homonyms.remove(homonym); + LOG( + "remove %s because word_rule would shadow %s", + token_name(homonym).c_str(), + token_name(other_token).c_str() + ); + } else if (other_shadows_word_rule && !other_shadows_homonym) { + homonyms.remove(homonym); + LOG( + "remove %s because %s would shadow word_rule", + token_name(homonym).c_str(), + token_name(other_token).c_str() + ); + } + return true; + }); + } + } + + if (!homonyms.empty()) { + LOG_START("found keywords:"); + homonyms.for_each([&](Symbol homonym) { + LOG("%s", token_name(homonym).c_str()); + return true; + }); + LOG_END(); + keyword_symbols = homonyms; + } } BuildResult build() { @@ -307,8 +274,8 @@ class LexTableBuilderImpl : public LexTableBuilder { for (ParseState &parse_state : parse_table->states) { LookaheadSet token_set; for (auto &entry : parse_state.terminal_entries) { - if (keyword_capture_token.is_terminal() && keyword_symbols.contains(entry.first)) { - token_set.insert(keyword_capture_token); + if (word_rule.is_terminal() && keyword_symbols.contains(entry.first)) { + token_set.insert(word_rule); } else { token_set.insert(entry.first); } @@ -337,7 +304,19 @@ class LexTableBuilderImpl : public LexTableBuilder { mark_fragile_tokens(); remove_duplicate_lex_states(main_lex_table); - return {main_lex_table, keyword_lex_table, keyword_capture_token}; + return {main_lex_table, keyword_lex_table, word_rule}; + } + + bool does_token_shadow_other(Symbol token, Symbol shadowed_token) const { + if (token == word_rule && keyword_symbols.contains(shadowed_token)) return false; + return get_conflict_status(shadowed_token, token) & ( + MatchesShorterStringWithinSeparators | + MatchesLongerStringWithValidNextChar + ); + } + + bool does_token_match_same_string_as_other(Symbol token, Symbol shadowed_token) const { + return get_conflict_status(shadowed_token, token) & MatchesSameString; } ConflictStatus get_conflict_status(Symbol shadowed_token, Symbol other_token) const { @@ -410,12 +389,14 @@ class LexTableBuilderImpl : public LexTableBuilder { advance_symbol, MatchesLongerStringWithValidNextChar )) { - LOG( - "%s shadows %s followed by '%s'", - token_name(advance_symbol).c_str(), - token_name(accept_action.symbol).c_str(), - log_char(*conflicting_following_chars.included_chars.begin()) - ); + if (!conflicting_following_chars.included_chars.empty()) { + LOG( + "%s shadows %s followed by '%s'", + token_name(advance_symbol).c_str(), + token_name(accept_action.symbol).c_str(), + log_char(*conflicting_following_chars.included_chars.begin()) + ); + } } } } @@ -665,8 +646,12 @@ LexTableBuilder::BuildResult LexTableBuilder::build() { return static_cast(this)->build(); } -ConflictStatus LexTableBuilder::get_conflict_status(Symbol a, Symbol b) const { - return static_cast(this)->get_conflict_status(a, b); +bool LexTableBuilder::does_token_shadow_other(Symbol a, Symbol b) const { + return static_cast(this)->does_token_shadow_other(a, b); +} + +bool LexTableBuilder::does_token_match_same_string_as_other(Symbol a, Symbol b) const { + return static_cast(this)->does_token_match_same_string_as_other(a, b); } } // namespace build_tables diff --git a/src/compiler/build_tables/lex_table_builder.h b/src/compiler/build_tables/lex_table_builder.h index 4ec4f22b..d69b996b 100644 --- a/src/compiler/build_tables/lex_table_builder.h +++ b/src/compiler/build_tables/lex_table_builder.h @@ -30,19 +30,6 @@ namespace build_tables { class LookaheadSet; -enum ConflictStatus { - DoesNotMatch = 0, - MatchesShorterStringWithinSeparators = 1 << 0, - MatchesSameString = 1 << 1, - MatchesLongerString = 1 << 2, - MatchesLongerStringWithValidNextChar = 1 << 3, - CannotDistinguish = ( - MatchesShorterStringWithinSeparators | - MatchesSameString | - MatchesLongerStringWithValidNextChar - ), -}; - struct CoincidentTokenIndex { std::unordered_map< std::pair, @@ -69,7 +56,8 @@ class LexTableBuilder { BuildResult build(); - ConflictStatus get_conflict_status(rules::Symbol, rules::Symbol) const; + bool does_token_shadow_other(rules::Symbol, rules::Symbol) const; + bool does_token_match_same_string_as_other(rules::Symbol, rules::Symbol) const; protected: LexTableBuilder() = default; diff --git a/src/compiler/build_tables/parse_table_builder.cc b/src/compiler/build_tables/parse_table_builder.cc index 0e6b4247..26dae5b7 100644 --- a/src/compiler/build_tables/parse_table_builder.cc +++ b/src/compiler/build_tables/parse_table_builder.cc @@ -134,11 +134,6 @@ class ParseTableBuilderImpl : public ParseTableBuilder { } void build_error_parse_state(ParseStateId state_id) { - unsigned CannotMerge = ( - MatchesShorterStringWithinSeparators | - MatchesLongerStringWithValidNextChar - ); - parse_table.states[state_id].terminal_entries.clear(); // First, identify the conflict-free tokens. @@ -149,7 +144,7 @@ class ParseTableBuilderImpl : public ParseTableBuilder { for (unsigned j = 0; j < lexical_grammar.variables.size(); j++) { Symbol other_token = Symbol::terminal(j); if (!coincident_token_index.contains(token, other_token) && - (lex_table_builder->get_conflict_status(other_token, token) & CannotMerge)) { + lex_table_builder->does_token_shadow_other(token, other_token)) { conflicts_with_other_tokens = true; break; } @@ -171,7 +166,7 @@ class ParseTableBuilderImpl : public ParseTableBuilder { bool conflicts_with_other_tokens = false; conflict_free_tokens.for_each([&](Symbol other_token) { if (!coincident_token_index.contains(token, other_token) && - (lex_table_builder->get_conflict_status(other_token, token) & CannotMerge)) { + lex_table_builder->does_token_shadow_other(token, other_token)) { LOG( "exclude %s: conflicts with %s", symbol_name(token).c_str(), @@ -517,7 +512,8 @@ class ParseTableBuilderImpl : public ParseTableBuilder { // Do not add a token if it conflicts with an existing token. if (!new_token.is_built_in()) { for (const auto &entry : state.terminal_entries) { - if (lex_table_builder->get_conflict_status(entry.first, new_token) & CannotDistinguish) { + if (lex_table_builder->does_token_shadow_other(new_token, entry.first) || + lex_table_builder->does_token_match_same_string_as_other(new_token, entry.first)) { LOG_IF( logged_conflict_tokens.insert({entry.first, new_token}).second, "cannot merge parse states due to token conflict: %s and %s", diff --git a/src/compiler/grammar.h b/src/compiler/grammar.h index 6c63340c..5e2212fb 100644 --- a/src/compiler/grammar.h +++ b/src/compiler/grammar.h @@ -32,6 +32,7 @@ struct InputGrammar { std::vector> expected_conflicts; std::vector external_tokens; std::unordered_set variables_to_inline; + rules::NamedSymbol word_rule; }; } // namespace tree_sitter diff --git a/src/compiler/log.cc b/src/compiler/log.cc index c0f3a03c..4b1e3dbf 100644 --- a/src/compiler/log.cc +++ b/src/compiler/log.cc @@ -1,4 +1,5 @@ #include "compiler/log.h" +#include static const char *SPACES = " "; @@ -21,6 +22,7 @@ void _indent_logs() { } void _outdent_logs() { + assert(_indent_level > 0); _indent_level--; } diff --git a/src/compiler/parse_grammar.cc b/src/compiler/parse_grammar.cc index e233cfe0..f589d15a 100644 --- a/src/compiler/parse_grammar.cc +++ b/src/compiler/parse_grammar.cc @@ -229,7 +229,9 @@ ParseGrammarResult parse_grammar(const string &input) { string error_message; string name; InputGrammar grammar; - json_value name_json, rules_json, extras_json, conflicts_json, external_tokens_json, inline_rules_json; + json_value + name_json, rules_json, extras_json, conflicts_json, external_tokens_json, + inline_rules_json, word_rule_json; json_settings settings = { 0, json_enable_comments, 0, 0, 0, 0 }; char parse_error[json_error_max]; @@ -359,6 +361,16 @@ ParseGrammarResult parse_grammar(const string &input) { } } + word_rule_json = grammar_json->operator[]("word"); + if (word_rule_json.type != json_none) { + if (word_rule_json.type != json_string) { + error_message = "Invalid word property"; + goto error; + } + + grammar.word_rule = NamedSymbol { word_rule_json.u.string.ptr }; + } + json_value_free(grammar_json); return { name, grammar, "" }; diff --git a/src/compiler/prepare_grammar/expand_repeats.cc b/src/compiler/prepare_grammar/expand_repeats.cc index 46230867..42878376 100644 --- a/src/compiler/prepare_grammar/expand_repeats.cc +++ b/src/compiler/prepare_grammar/expand_repeats.cc @@ -106,6 +106,7 @@ InitialSyntaxGrammar expand_repeats(const InitialSyntaxGrammar &grammar) { expander.aux_rules.end() ); + result.word_rule = grammar.word_rule; return result; } diff --git a/src/compiler/prepare_grammar/extract_tokens.cc b/src/compiler/prepare_grammar/extract_tokens.cc index 93b06be2..c82b3505 100644 --- a/src/compiler/prepare_grammar/extract_tokens.cc +++ b/src/compiler/prepare_grammar/extract_tokens.cc @@ -329,6 +329,18 @@ tuple extract_tokens( } } + syntax_grammar.word_rule = symbol_replacer.replace_symbol(grammar.word_rule); + if (syntax_grammar.word_rule.is_non_terminal()) { + return make_tuple( + syntax_grammar, + lexical_grammar, + CompileError( + TSCompileErrorTypeInvalidWordRule, + "Word rules must be tokens" + ) + ); + } + return make_tuple(syntax_grammar, lexical_grammar, CompileError::none()); } diff --git a/src/compiler/prepare_grammar/flatten_grammar.cc b/src/compiler/prepare_grammar/flatten_grammar.cc index e135ee67..ebfc3ae4 100644 --- a/src/compiler/prepare_grammar/flatten_grammar.cc +++ b/src/compiler/prepare_grammar/flatten_grammar.cc @@ -161,6 +161,8 @@ pair flatten_grammar(const InitialSyntaxGrammar &gr i++; } + result.word_rule = grammar.word_rule; + return {result, CompileError::none()}; } diff --git a/src/compiler/prepare_grammar/initial_syntax_grammar.h b/src/compiler/prepare_grammar/initial_syntax_grammar.h index 881c6396..4f21e3cd 100644 --- a/src/compiler/prepare_grammar/initial_syntax_grammar.h +++ b/src/compiler/prepare_grammar/initial_syntax_grammar.h @@ -17,6 +17,7 @@ struct InitialSyntaxGrammar { std::set> expected_conflicts; std::vector external_tokens; std::set variables_to_inline; + rules::Symbol word_rule; }; } // namespace prepare_grammar diff --git a/src/compiler/prepare_grammar/intern_symbols.cc b/src/compiler/prepare_grammar/intern_symbols.cc index 4e610960..dc128779 100644 --- a/src/compiler/prepare_grammar/intern_symbols.cc +++ b/src/compiler/prepare_grammar/intern_symbols.cc @@ -166,6 +166,8 @@ pair intern_symbols(const InputGrammar &grammar) } } + result.word_rule = interner.intern_symbol(grammar.word_rule); + return {result, CompileError::none()}; } diff --git a/src/compiler/prepare_grammar/interned_grammar.h b/src/compiler/prepare_grammar/interned_grammar.h index 83117ced..fc322522 100644 --- a/src/compiler/prepare_grammar/interned_grammar.h +++ b/src/compiler/prepare_grammar/interned_grammar.h @@ -15,8 +15,8 @@ struct InternedGrammar { std::vector extra_tokens; std::set> expected_conflicts; std::vector external_tokens; - std::set blank_external_tokens; std::set variables_to_inline; + rules::Symbol word_rule; }; } // namespace prepare_grammar diff --git a/src/compiler/syntax_grammar.h b/src/compiler/syntax_grammar.h index 2d55686b..7d2b1be1 100644 --- a/src/compiler/syntax_grammar.h +++ b/src/compiler/syntax_grammar.h @@ -60,6 +60,7 @@ struct SyntaxGrammar { std::set> expected_conflicts; std::vector external_tokens; std::set variables_to_inline; + rules::Symbol word_rule; }; } // namespace tree_sitter diff --git a/src/runtime/array.h b/src/runtime/array.h index 45b3adaa..b32487a2 100644 --- a/src/runtime/array.h +++ b/src/runtime/array.h @@ -110,7 +110,7 @@ static inline void array__grow(VoidArray *self, size_t element_size) { static inline void array__splice(VoidArray *self, size_t element_size, uint32_t index, uint32_t old_count, - uint32_t new_count, void *elements) { + uint32_t new_count, const void *elements) { uint32_t new_size = self->size + new_count - old_count; uint32_t old_end = index + old_count; uint32_t new_end = index + new_count; diff --git a/src/runtime/node.c b/src/runtime/node.c index 607cf9de..0855ec66 100644 --- a/src/runtime/node.c +++ b/src/runtime/node.c @@ -28,11 +28,11 @@ static inline TSNode ts_node__null() { // TSNode - accessors -uint32_t ts_node_start_byte(const TSNode self) { +uint32_t ts_node_start_byte(TSNode self) { return self.context[0]; } -TSPoint ts_node_start_point(const TSNode self) { +TSPoint ts_node_start_point(TSNode self) { return (TSPoint) {self.context[1], self.context[2]}; } diff --git a/src/runtime/subtree.c b/src/runtime/subtree.c index 9b2d954f..83a1ad4e 100644 --- a/src/runtime/subtree.c +++ b/src/runtime/subtree.c @@ -59,7 +59,7 @@ bool ts_external_scanner_state_eq(const ExternalScannerState *a, const ExternalS // SubtreeArray bool ts_subtree_array_copy(SubtreeArray self, SubtreeArray *dest) { - const Subtree **contents = NULL; + Subtree **contents = NULL; if (self.capacity > 0) { contents = ts_calloc(self.capacity, sizeof(Subtree *)); memcpy(contents, self.contents, self.size * sizeof(Subtree *)); diff --git a/test/compiler/build_tables/parse_item_set_builder_test.cc b/test/compiler/build_tables/parse_item_set_builder_test.cc index 6cf5bb0e..6c41c3ca 100644 --- a/test/compiler/build_tables/parse_item_set_builder_test.cc +++ b/test/compiler/build_tables/parse_item_set_builder_test.cc @@ -25,7 +25,8 @@ describe("ParseItemSetBuilder", []() { LexicalGrammar lexical_grammar{lexical_variables, {}}; it("adds items at the beginnings of referenced rules", [&]() { - SyntaxGrammar grammar{{ + SyntaxGrammar grammar; + grammar.variables = { SyntaxVariable{"rule0", VariableTypeNamed, { Production({ {Symbol::non_terminal(1), 0, AssociativityNone, Alias{}}, @@ -47,7 +48,7 @@ describe("ParseItemSetBuilder", []() { {Symbol::terminal(15), 0, AssociativityNone, Alias{}}, }, 0) }}, - }, {}, {}, {}, {}}; + }; auto production = [&](int variable_index, int production_index) -> const Production & { return grammar.variables[variable_index].productions[production_index]; @@ -84,7 +85,8 @@ describe("ParseItemSetBuilder", []() { }); it("handles rules with empty productions", [&]() { - SyntaxGrammar grammar{{ + SyntaxGrammar grammar; + grammar.variables = { SyntaxVariable{"rule0", VariableTypeNamed, { Production({ {Symbol::non_terminal(1), 0, AssociativityNone, Alias{}}, @@ -98,7 +100,7 @@ describe("ParseItemSetBuilder", []() { }, 0), Production{{}, 0} }}, - }, {}, {}, {}, {}}; + }; auto production = [&](int variable_index, int production_index) -> const Production & { return grammar.variables[variable_index].productions[production_index]; diff --git a/test/compiler/prepare_grammar/expand_repeats_test.cc b/test/compiler/prepare_grammar/expand_repeats_test.cc index 250bd59b..f7aaa8fe 100644 --- a/test/compiler/prepare_grammar/expand_repeats_test.cc +++ b/test/compiler/prepare_grammar/expand_repeats_test.cc @@ -11,11 +11,9 @@ START_TEST describe("expand_repeats", []() { it("replaces repeat rules with pairs of recursive rules", [&]() { - InitialSyntaxGrammar grammar{ - { - Variable{"rule0", VariableTypeNamed, Repeat{Symbol::terminal(0)}}, - }, - {}, {}, {}, {} + InitialSyntaxGrammar grammar; + grammar.variables = { + Variable{"rule0", VariableTypeNamed, Repeat{Symbol::terminal(0)}}, }; auto result = expand_repeats(grammar); @@ -30,14 +28,12 @@ describe("expand_repeats", []() { }); it("replaces repeats inside of sequences", [&]() { - InitialSyntaxGrammar grammar{ - { - Variable{"rule0", VariableTypeNamed, Rule::seq({ - Symbol::terminal(10), - Repeat{Symbol::terminal(11)}, - })}, - }, - {}, {}, {}, {} + InitialSyntaxGrammar grammar; + grammar.variables = { + Variable{"rule0", VariableTypeNamed, Rule::seq({ + Symbol::terminal(10), + Repeat{Symbol::terminal(11)}, + })}, }; auto result = expand_repeats(grammar); @@ -55,14 +51,12 @@ describe("expand_repeats", []() { }); it("replaces repeats inside of choices", [&]() { - InitialSyntaxGrammar grammar{ - { - Variable{"rule0", VariableTypeNamed, Rule::choice({ - Symbol::terminal(10), - Repeat{Symbol::terminal(11)} - })}, - }, - {}, {}, {}, {} + InitialSyntaxGrammar grammar; + grammar.variables = { + Variable{"rule0", VariableTypeNamed, Rule::choice({ + Symbol::terminal(10), + Repeat{Symbol::terminal(11)} + })}, }; auto result = expand_repeats(grammar); @@ -80,18 +74,16 @@ describe("expand_repeats", []() { }); it("does not create redundant auxiliary rules", [&]() { - InitialSyntaxGrammar grammar{ - { - Variable{"rule0", VariableTypeNamed, Rule::choice({ - Rule::seq({ Symbol::terminal(1), Repeat{Symbol::terminal(4)} }), - Rule::seq({ Symbol::terminal(2), Repeat{Symbol::terminal(4)} }), - })}, - Variable{"rule1", VariableTypeNamed, Rule::seq({ - Symbol::terminal(3), - Repeat{Symbol::terminal(4)} - })}, - }, - {}, {}, {}, {} + InitialSyntaxGrammar grammar; + grammar.variables = { + Variable{"rule0", VariableTypeNamed, Rule::choice({ + Rule::seq({ Symbol::terminal(1), Repeat{Symbol::terminal(4)} }), + Rule::seq({ Symbol::terminal(2), Repeat{Symbol::terminal(4)} }), + })}, + Variable{"rule1", VariableTypeNamed, Rule::seq({ + Symbol::terminal(3), + Repeat{Symbol::terminal(4)} + })}, }; auto result = expand_repeats(grammar); @@ -113,14 +105,14 @@ describe("expand_repeats", []() { }); it("can replace multiple repeats in the same rule", [&]() { - InitialSyntaxGrammar grammar{ + InitialSyntaxGrammar grammar; + grammar.variables = { { Variable{"rule0", VariableTypeNamed, Rule::seq({ Repeat{Symbol::terminal(10)}, Repeat{Symbol::terminal(11)}, })}, - }, - {}, {}, {}, {} + } }; auto result = expand_repeats(grammar); @@ -142,12 +134,10 @@ describe("expand_repeats", []() { }); it("can replace repeats in multiple rules", [&]() { - InitialSyntaxGrammar grammar{ - { - Variable{"rule0", VariableTypeNamed, Repeat{Symbol::terminal(10)}}, - Variable{"rule1", VariableTypeNamed, Repeat{Symbol::terminal(11)}}, - }, - {}, {}, {}, {} + InitialSyntaxGrammar grammar; + grammar.variables = { + Variable{"rule0", VariableTypeNamed, Repeat{Symbol::terminal(10)}}, + Variable{"rule1", VariableTypeNamed, Repeat{Symbol::terminal(11)}}, }; auto result = expand_repeats(grammar); diff --git a/test/compiler/prepare_grammar/intern_symbols_test.cc b/test/compiler/prepare_grammar/intern_symbols_test.cc index 65bad45e..7b7f3624 100644 --- a/test/compiler/prepare_grammar/intern_symbols_test.cc +++ b/test/compiler/prepare_grammar/intern_symbols_test.cc @@ -11,13 +11,11 @@ using prepare_grammar::intern_symbols; describe("intern_symbols", []() { it("replaces named symbols with numerically-indexed symbols", [&]() { - InputGrammar grammar{ - { - {"x", VariableTypeNamed, Rule::choice({ NamedSymbol{"y"}, NamedSymbol{"_z"} })}, - {"y", VariableTypeNamed, NamedSymbol{"_z"}}, - {"_z", VariableTypeNamed, String{"stuff"}} - }, - {}, {}, {}, {} + InputGrammar grammar; + grammar.variables = { + {"x", VariableTypeNamed, Rule::choice({ NamedSymbol{"y"}, NamedSymbol{"_z"} })}, + {"y", VariableTypeNamed, NamedSymbol{"_z"}}, + {"_z", VariableTypeNamed, String{"stuff"}} }; auto result = intern_symbols(grammar); @@ -32,11 +30,9 @@ describe("intern_symbols", []() { describe("when there are symbols that reference undefined rules", [&]() { it("returns an error", []() { - InputGrammar grammar{ - { - {"x", VariableTypeNamed, NamedSymbol{"y"}}, - }, - {}, {}, {}, {} + InputGrammar grammar; + grammar.variables = { + {"x", VariableTypeNamed, NamedSymbol{"y"}}, }; auto result = intern_symbols(grammar); @@ -46,16 +42,14 @@ describe("intern_symbols", []() { }); it("translates the grammar's optional 'extra_tokens' to numerical symbols", [&]() { - InputGrammar grammar{ - { - {"x", VariableTypeNamed, Rule::choice({ NamedSymbol{"y"}, NamedSymbol{"z"} })}, - {"y", VariableTypeNamed, NamedSymbol{"z"}}, - {"z", VariableTypeNamed, String{"stuff"}} - }, - { - NamedSymbol{"z"} - }, - {}, {}, {} + InputGrammar grammar; + grammar.variables = { + {"x", VariableTypeNamed, Rule::choice({ NamedSymbol{"y"}, NamedSymbol{"z"} })}, + {"y", VariableTypeNamed, NamedSymbol{"z"}}, + {"z", VariableTypeNamed, String{"stuff"}} + }; + grammar.extra_tokens = { + NamedSymbol{"z"} }; auto result = intern_symbols(grammar); @@ -66,19 +60,15 @@ describe("intern_symbols", []() { }); it("records any rule names that match external token names", [&]() { - InputGrammar grammar{ - { - {"x", VariableTypeNamed, Rule::choice({ NamedSymbol{"y"}, NamedSymbol{"z"} })}, - {"y", VariableTypeNamed, NamedSymbol{"z"}}, - {"z", VariableTypeNamed, String{"stuff"}}, - }, - {}, - {}, - { - NamedSymbol{"w"}, - NamedSymbol{"z"}, - }, - {} + InputGrammar grammar; + grammar.variables = { + {"x", VariableTypeNamed, Rule::choice({ NamedSymbol{"y"}, NamedSymbol{"z"} })}, + {"y", VariableTypeNamed, NamedSymbol{"z"}}, + {"z", VariableTypeNamed, String{"stuff"}}, + }; + grammar.external_tokens = { + NamedSymbol{"w"}, + NamedSymbol{"z"}, }; auto result = intern_symbols(grammar);