Merge pull request #176 from tree-sitter/explicit-word-token
Perform keyword optimization using explicitly selected word token
This commit is contained in:
commit
245052442a
22 changed files with 305 additions and 247 deletions
|
|
@ -32,6 +32,7 @@
|
|||
<div id="current-page-table-of-contents">
|
||||
{% capture whitespace %}
|
||||
{% assign min_header = 2 %}
|
||||
{% assign max_header = 3 %}
|
||||
{% assign nodes = content | split: "<h" %}
|
||||
{% assign first_header = true %}
|
||||
{% for node in nodes %}
|
||||
|
|
@ -41,7 +42,7 @@
|
|||
|
||||
{% assign header_level = node | replace: '"', '' | slice: 0, 1 | times: 1 %}
|
||||
|
||||
{% if header_level < min_header or header_level > maxHeader %}
|
||||
{% if header_level < min_header or header_level > max_header %}
|
||||
{% continue %}
|
||||
{% endif %}
|
||||
|
||||
|
|
@ -127,7 +128,7 @@
|
|||
}
|
||||
});
|
||||
|
||||
$('h1, h2, h3, h4, h5, h6').filter('[id]').each(function() {
|
||||
$('h1, h2, h3').filter('[id]').each(function() {
|
||||
$(this).html('<a href="#'+$(this).attr('id')+'">' + $(this).text() + '</a>');
|
||||
});
|
||||
</script>
|
||||
|
|
|
|||
|
|
@ -211,12 +211,13 @@ The following is a complete list of built-in functions you can use to define Tre
|
|||
* **Tokens : `token(rule)`** - This function marks the given rule as producing only a single token. Tree-sitter's default is to treat each String or RegExp literal in the grammar as a separate token. Each token is matched separately by the lexer and returned as its own leaf node in the tree. The `token` function allows you to express a complex rule using the functions described above (rather than as a single regular expression) but still have Tree-sitter treat it as a single token.
|
||||
* **Aliases : `alias(rule, name)`** - This function causes the given rule to *appear* with an alternative name in the syntax tree. It is useful in cases where a language construct needs to be parsed differently in different contexts (and thus needs to be defined using multiple symbols), but should always *appear* as the same type of node.
|
||||
|
||||
In addition to the `name` and `rules` fields, grammars have a few other public fields that influence the behavior of the parser.
|
||||
In addition to the `name` and `rules` fields, grammars have a few other optional public fields that influence the behavior of the parser.
|
||||
|
||||
* `extras` - an array of tokens that may appear *anywhere* in the language. This is often used for whitespace and comments. The default for `extras` in `tree-sitter-cli` is to accept whitespace. To control whitespace explicitly, specify `extras=[]` in the grammar.
|
||||
* `inline` - an array of rule names that should be automatically *removed* from the grammar by replacing all of their usages with a copy of their definition. This is useful for rules that are used in multiple places but for which you *don't* want to create syntax tree nodes at runtime.
|
||||
* `conflicts` - an array of arrays of rule names. Each inner array represents a set of rules that's involved in an *LR(1) conflict* that is *intended to exist* in the grammar. When these conflicts occur at runtime, Tree-sitter will use the GLR algorithm to explore all of the possible interpretations. If *multiple* parses end up succeeding, Tree-sitter will pick the subtree rule with the highest *dynamic precedence*.
|
||||
* `externals` - an array of toen names which can be returned by an *external scanner*. External scanners allow you to write custom C code which runs during the lexing process in order to handle lexical rules (e.g. Python's indentation tokens) that cannot be described by regular expressions.
|
||||
* `word` - the name of a token that will match keywords for the purpose of the [keyword extraction](#keyword-extraction) optimization.
|
||||
|
||||
## Adjusting existing grammars
|
||||
|
||||
|
|
@ -355,11 +356,81 @@ For an expression like `a * b * c`, it's not clear whether we mean `a * (b * c)`
|
|||
|
||||
You may have noticed in the above examples that some of the grammar rule name like `_expression` and `_type` began with an underscore. Starting a rule's name with an underscore causes the rule to be *hidden* in the syntax tree. This is useful for rules like `_expression` in the grammars above, which always just wrap a single child node. If these nodes were not hidden, they would add substantial depth and noise to the syntax tree without making it any easier to understand.
|
||||
|
||||
## Dealing with LR conflicts
|
||||
### Dealing with LR conflicts
|
||||
|
||||
TODO
|
||||
...
|
||||
|
||||
## Lexical Analysis
|
||||
|
||||
Tree-sitter's parsing process is divided into two phases: parsing (which is described above) and [lexing](lexing) - the process of grouping individual characters into the language's fundamental *tokens*. There are a few important things to know about how Tree-sitter's lexing works.
|
||||
|
||||
### Conflict Resolution
|
||||
|
||||
Grammars often contain multiple tokens that can match the same characters. For example, a grammar might contain the tokens (`"if"` and `/[a-z]+/`). Tree-sitter differentiates between these conflicting tokens in a few ways:
|
||||
|
||||
1. **Context-aware lexing** - Tree-sitter performs lexing on-demand, during the parsing process. At any given position in a source document, the lexer only tries to recognize tokens that are *valid* at that position in the document.
|
||||
|
||||
2. **Longest-match** - If multiple valid tokens match the characters at a given position in a document, Tree-sitter will select the token that matches the [longest sequence of characters](longest-match).
|
||||
|
||||
3. **Lexical Precedence** - When the precedence functions described [above](#using-the-grammar-dsl) are used within the `token` function, the given precedence values serve as instructions to the lexer. If there are two valid tokens that match the same sequence of characters, Tree-sitter will select the one with the higher precedence.
|
||||
|
||||
### Keywords
|
||||
|
||||
If your language has keywords which are matched by a rule (typically `identifier`), you can tell Tree-sitter about it with your grammar's `word` property.
|
||||
|
||||
```js
|
||||
grammar({
|
||||
word: $ => $.identifier,
|
||||
|
||||
rules: {
|
||||
class_declaration: $ => seq(
|
||||
'class',
|
||||
$.identifier,
|
||||
$.class_body
|
||||
),
|
||||
|
||||
break_statement: $ => seq('break', ';'),
|
||||
|
||||
continue_statement: $ => seq('continue', ';'),
|
||||
|
||||
identifier: $ => /[a-z]+/
|
||||
}
|
||||
})
|
||||
```
|
||||
|
||||
In this case, we're specifying `identifier` as our `word`. Tree-sitter will automatically find the set of terminals which are matched by `$.identifier`, and consider them keywords. Instead of generating a parser which scans for each keyword individually, Tree-sitter will generate a parser that tries to match the word rule (in this case, `identifier`), and checks to see if the matched word is the necessary keyword.
|
||||
|
||||
This makes the set of parse states smaller, so the parser compiles faster.
|
||||
|
||||
It *also changes behavior*. Consider this grammar:
|
||||
|
||||
```js
|
||||
grammar({
|
||||
rules: {
|
||||
import: $ => seq(
|
||||
'import',
|
||||
$.identifier,
|
||||
'as',
|
||||
$.identifier
|
||||
),
|
||||
|
||||
identifier: $ => /[a-z]+/
|
||||
}
|
||||
})
|
||||
```
|
||||
|
||||
Without the `word` directive, the grammar matches this input:
|
||||
|
||||
```
|
||||
import foo asbar
|
||||
```
|
||||
|
||||
Which is probably not what you want. If we add `word: $ => $.identifier`, this will no longer parse. When we try to parse `'as'`, we will parse a word — which will be the identifier ``'asbar'``—and then compare it to `'as'`, correctly generating an error.
|
||||
|
||||
[lexing]: https://en.wikipedia.org/wiki/Lexical_analysis
|
||||
[longest-match]: https://en.wikipedia.org/wiki/Maximal_munch
|
||||
[cst]: https://en.wikipedia.org/wiki/Parse_tree
|
||||
[dfa]: https://en.wikipedia.org/wiki/Deterministic_finite_automaton
|
||||
[non-terminal]: https://en.wikipedia.org/wiki/Terminal_and_nonterminal_symbols
|
||||
[language-spec]: https://en.wikipedia.org/wiki/Programming_language_specification
|
||||
[glr-parsing]: https://en.wikipedia.org/wiki/GLR_parser
|
||||
|
|
|
|||
|
|
@ -19,6 +19,7 @@ typedef enum {
|
|||
TSCompileErrorTypeEpsilonRule,
|
||||
TSCompileErrorTypeInvalidTokenContents,
|
||||
TSCompileErrorTypeInvalidRuleName,
|
||||
TSCompileErrorTypeInvalidWordRule,
|
||||
} TSCompileErrorType;
|
||||
|
||||
typedef struct {
|
||||
|
|
|
|||
|
|
@ -49,6 +49,19 @@ using rules::Symbol;
|
|||
using rules::Metadata;
|
||||
using rules::Seq;
|
||||
|
||||
enum ConflictStatus {
|
||||
DoesNotMatch = 0,
|
||||
MatchesShorterStringWithinSeparators = 1 << 0,
|
||||
MatchesSameString = 1 << 1,
|
||||
MatchesLongerString = 1 << 2,
|
||||
MatchesLongerStringWithValidNextChar = 1 << 3,
|
||||
CannotDistinguish = (
|
||||
MatchesShorterStringWithinSeparators |
|
||||
MatchesSameString |
|
||||
MatchesLongerStringWithValidNextChar
|
||||
),
|
||||
};
|
||||
|
||||
static const std::unordered_set<ParseStateId> EMPTY;
|
||||
|
||||
bool CoincidentTokenIndex::contains(Symbol a, Symbol b) const {
|
||||
|
|
@ -65,14 +78,12 @@ const std::unordered_set<ParseStateId> &CoincidentTokenIndex::states_with(Symbol
|
|||
}
|
||||
}
|
||||
|
||||
template <bool include_all>
|
||||
class CharacterAggregator {
|
||||
class StartingCharacterAggregator {
|
||||
public:
|
||||
void apply(const Rule &rule) {
|
||||
rule.match(
|
||||
[this](const Seq &sequence) {
|
||||
apply(*sequence.left);
|
||||
if (include_all) apply(*sequence.right);
|
||||
},
|
||||
|
||||
[this](const rules::Choice &rule) {
|
||||
|
|
@ -91,9 +102,6 @@ class CharacterAggregator {
|
|||
CharacterSet result;
|
||||
};
|
||||
|
||||
using StartingCharacterAggregator = CharacterAggregator<false>;
|
||||
using AllCharacterAggregator = CharacterAggregator<true>;
|
||||
|
||||
class LexTableBuilderImpl : public LexTableBuilder {
|
||||
LexTable main_lex_table;
|
||||
LexTable keyword_lex_table;
|
||||
|
|
@ -109,7 +117,7 @@ class LexTableBuilderImpl : public LexTableBuilder {
|
|||
vector<ConflictStatus> conflict_matrix;
|
||||
bool conflict_detection_mode;
|
||||
LookaheadSet keyword_symbols;
|
||||
Symbol keyword_capture_token;
|
||||
Symbol word_rule;
|
||||
char encoding_buffer[8];
|
||||
|
||||
public:
|
||||
|
|
@ -125,7 +133,7 @@ class LexTableBuilderImpl : public LexTableBuilder {
|
|||
parse_table(parse_table),
|
||||
conflict_matrix(lexical_grammar.variables.size() * lexical_grammar.variables.size(), DoesNotMatch),
|
||||
conflict_detection_mode(false),
|
||||
keyword_capture_token(rules::NONE()) {
|
||||
word_rule(syntax_grammar.word_rule) {
|
||||
|
||||
// Compute the possible separator rules and the set of separator characters that can occur
|
||||
// immediately after any token.
|
||||
|
|
@ -141,7 +149,6 @@ class LexTableBuilderImpl : public LexTableBuilder {
|
|||
// characters that can follow each token. Also identify all of the tokens that can be
|
||||
// considered 'keywords'.
|
||||
LOG_START("characterizing tokens");
|
||||
LookaheadSet potential_keyword_symbols;
|
||||
for (unsigned i = 0, n = grammar.variables.size(); i < n; i++) {
|
||||
Symbol token = Symbol::terminal(i);
|
||||
|
||||
|
|
@ -158,31 +165,6 @@ class LexTableBuilderImpl : public LexTableBuilder {
|
|||
});
|
||||
}
|
||||
following_characters_by_token[i] = following_character_aggregator.result;
|
||||
|
||||
AllCharacterAggregator all_character_aggregator;
|
||||
all_character_aggregator.apply(grammar.variables[i].rule);
|
||||
|
||||
if (
|
||||
!starting_character_aggregator.result.includes_all &&
|
||||
!all_character_aggregator.result.includes_all
|
||||
) {
|
||||
bool starts_alpha = true, all_alnum = true;
|
||||
for (auto character : starting_character_aggregator.result.included_chars) {
|
||||
if (!iswalpha(character) && character != '_') {
|
||||
starts_alpha = false;
|
||||
}
|
||||
}
|
||||
for (auto character : all_character_aggregator.result.included_chars) {
|
||||
if (!iswalnum(character) && character != '_') {
|
||||
all_alnum = false;
|
||||
}
|
||||
}
|
||||
if (starts_alpha && all_alnum) {
|
||||
LOG("potential keyword: %s", token_name(token).c_str());
|
||||
potential_keyword_symbols.insert(token);
|
||||
}
|
||||
}
|
||||
|
||||
}
|
||||
LOG_END();
|
||||
|
||||
|
|
@ -205,98 +187,83 @@ class LexTableBuilderImpl : public LexTableBuilder {
|
|||
}
|
||||
LOG_END();
|
||||
|
||||
LOG_START("finding keyword capture token");
|
||||
for (Symbol::Index i = 0, n = grammar.variables.size(); i < n; i++) {
|
||||
Symbol candidate = Symbol::terminal(i);
|
||||
if (word_rule != rules::NONE()) {
|
||||
identify_keywords();
|
||||
}
|
||||
}
|
||||
|
||||
LookaheadSet homonyms;
|
||||
potential_keyword_symbols.for_each([&](Symbol other_token) {
|
||||
if (get_conflict_status(other_token, candidate) & MatchesShorterStringWithinSeparators) {
|
||||
homonyms.clear();
|
||||
return false;
|
||||
}
|
||||
if (get_conflict_status(candidate, other_token) == MatchesSameString) {
|
||||
homonyms.insert(other_token);
|
||||
}
|
||||
return true;
|
||||
});
|
||||
if (homonyms.empty()) continue;
|
||||
|
||||
LOG_START(
|
||||
"keyword capture token candidate: %s, homonym count: %lu",
|
||||
token_name(candidate).c_str(),
|
||||
homonyms.size()
|
||||
);
|
||||
|
||||
homonyms.for_each([&](Symbol homonym1) {
|
||||
homonyms.for_each([&](Symbol homonym2) {
|
||||
if (get_conflict_status(homonym1, homonym2) & MatchesSameString) {
|
||||
LOG(
|
||||
"conflict between homonyms %s %s",
|
||||
token_name(homonym1).c_str(),
|
||||
token_name(homonym2).c_str()
|
||||
);
|
||||
homonyms.remove(homonym1);
|
||||
}
|
||||
return false;
|
||||
});
|
||||
return true;
|
||||
});
|
||||
|
||||
for (Symbol::Index j = 0; j < n; j++) {
|
||||
Symbol other_token = Symbol::terminal(j);
|
||||
if (other_token == candidate || homonyms.contains(other_token)) continue;
|
||||
bool candidate_shadows_other = get_conflict_status(other_token, candidate);
|
||||
bool other_shadows_candidate = get_conflict_status(candidate, other_token);
|
||||
|
||||
if (candidate_shadows_other || other_shadows_candidate) {
|
||||
homonyms.for_each([&](Symbol homonym) {
|
||||
bool other_shadows_homonym = get_conflict_status(homonym, other_token);
|
||||
|
||||
bool candidate_was_already_present = true;
|
||||
for (ParseStateId state_id : coincident_token_index.states_with(homonym, other_token)) {
|
||||
if (!parse_table->states[state_id].has_terminal_entry(candidate)) {
|
||||
candidate_was_already_present = false;
|
||||
break;
|
||||
}
|
||||
}
|
||||
if (candidate_was_already_present) return true;
|
||||
|
||||
if (candidate_shadows_other) {
|
||||
homonyms.remove(homonym);
|
||||
LOG(
|
||||
"remove %s because candidate would shadow %s",
|
||||
token_name(homonym).c_str(),
|
||||
token_name(other_token).c_str()
|
||||
);
|
||||
} else if (other_shadows_candidate && !other_shadows_homonym) {
|
||||
homonyms.remove(homonym);
|
||||
LOG(
|
||||
"remove %s because %s would shadow candidate",
|
||||
token_name(homonym).c_str(),
|
||||
token_name(other_token).c_str()
|
||||
);
|
||||
}
|
||||
return true;
|
||||
});
|
||||
}
|
||||
void identify_keywords() {
|
||||
LookaheadSet homonyms;
|
||||
for (Symbol::Index j = 0, n = grammar.variables.size(); j < n; j++) {
|
||||
Symbol other_token = Symbol::terminal(j);
|
||||
if (get_conflict_status(word_rule, other_token) == MatchesSameString) {
|
||||
homonyms.insert(other_token);
|
||||
}
|
||||
|
||||
if (homonyms.size() > keyword_symbols.size()) {
|
||||
LOG_START("found capture token. homonyms:");
|
||||
homonyms.for_each([&](Symbol homonym) {
|
||||
LOG("%s", token_name(homonym).c_str());
|
||||
return true;
|
||||
});
|
||||
LOG_END();
|
||||
keyword_symbols = homonyms;
|
||||
keyword_capture_token = candidate;
|
||||
}
|
||||
|
||||
LOG_END();
|
||||
}
|
||||
|
||||
LOG_END();
|
||||
homonyms.for_each([&](Symbol homonym1) {
|
||||
homonyms.for_each([&](Symbol homonym2) {
|
||||
if (get_conflict_status(homonym1, homonym2) & MatchesSameString) {
|
||||
LOG(
|
||||
"conflict between homonyms %s %s",
|
||||
token_name(homonym1).c_str(),
|
||||
token_name(homonym2).c_str()
|
||||
);
|
||||
homonyms.remove(homonym1);
|
||||
}
|
||||
return false;
|
||||
});
|
||||
return true;
|
||||
});
|
||||
|
||||
for (Symbol::Index j = 0, n = grammar.variables.size(); j < n; j++) {
|
||||
Symbol other_token = Symbol::terminal(j);
|
||||
if (other_token == word_rule || homonyms.contains(other_token)) continue;
|
||||
bool word_rule_shadows_other = get_conflict_status(other_token, word_rule);
|
||||
bool other_shadows_word_rule = get_conflict_status(word_rule, other_token);
|
||||
|
||||
if (word_rule_shadows_other || other_shadows_word_rule) {
|
||||
homonyms.for_each([&](Symbol homonym) {
|
||||
bool other_shadows_homonym = get_conflict_status(homonym, other_token);
|
||||
|
||||
bool word_rule_was_already_present = true;
|
||||
for (ParseStateId state_id : coincident_token_index.states_with(homonym, other_token)) {
|
||||
if (!parse_table->states[state_id].has_terminal_entry(word_rule)) {
|
||||
word_rule_was_already_present = false;
|
||||
break;
|
||||
}
|
||||
}
|
||||
if (word_rule_was_already_present) return true;
|
||||
|
||||
if (word_rule_shadows_other) {
|
||||
homonyms.remove(homonym);
|
||||
LOG(
|
||||
"remove %s because word_rule would shadow %s",
|
||||
token_name(homonym).c_str(),
|
||||
token_name(other_token).c_str()
|
||||
);
|
||||
} else if (other_shadows_word_rule && !other_shadows_homonym) {
|
||||
homonyms.remove(homonym);
|
||||
LOG(
|
||||
"remove %s because %s would shadow word_rule",
|
||||
token_name(homonym).c_str(),
|
||||
token_name(other_token).c_str()
|
||||
);
|
||||
}
|
||||
return true;
|
||||
});
|
||||
}
|
||||
}
|
||||
|
||||
if (!homonyms.empty()) {
|
||||
LOG_START("found keywords:");
|
||||
homonyms.for_each([&](Symbol homonym) {
|
||||
LOG("%s", token_name(homonym).c_str());
|
||||
return true;
|
||||
});
|
||||
LOG_END();
|
||||
keyword_symbols = homonyms;
|
||||
}
|
||||
}
|
||||
|
||||
BuildResult build() {
|
||||
|
|
@ -307,8 +274,8 @@ class LexTableBuilderImpl : public LexTableBuilder {
|
|||
for (ParseState &parse_state : parse_table->states) {
|
||||
LookaheadSet token_set;
|
||||
for (auto &entry : parse_state.terminal_entries) {
|
||||
if (keyword_capture_token.is_terminal() && keyword_symbols.contains(entry.first)) {
|
||||
token_set.insert(keyword_capture_token);
|
||||
if (word_rule.is_terminal() && keyword_symbols.contains(entry.first)) {
|
||||
token_set.insert(word_rule);
|
||||
} else {
|
||||
token_set.insert(entry.first);
|
||||
}
|
||||
|
|
@ -337,7 +304,19 @@ class LexTableBuilderImpl : public LexTableBuilder {
|
|||
|
||||
mark_fragile_tokens();
|
||||
remove_duplicate_lex_states(main_lex_table);
|
||||
return {main_lex_table, keyword_lex_table, keyword_capture_token};
|
||||
return {main_lex_table, keyword_lex_table, word_rule};
|
||||
}
|
||||
|
||||
bool does_token_shadow_other(Symbol token, Symbol shadowed_token) const {
|
||||
if (token == word_rule && keyword_symbols.contains(shadowed_token)) return false;
|
||||
return get_conflict_status(shadowed_token, token) & (
|
||||
MatchesShorterStringWithinSeparators |
|
||||
MatchesLongerStringWithValidNextChar
|
||||
);
|
||||
}
|
||||
|
||||
bool does_token_match_same_string_as_other(Symbol token, Symbol shadowed_token) const {
|
||||
return get_conflict_status(shadowed_token, token) & MatchesSameString;
|
||||
}
|
||||
|
||||
ConflictStatus get_conflict_status(Symbol shadowed_token, Symbol other_token) const {
|
||||
|
|
@ -410,12 +389,14 @@ class LexTableBuilderImpl : public LexTableBuilder {
|
|||
advance_symbol,
|
||||
MatchesLongerStringWithValidNextChar
|
||||
)) {
|
||||
LOG(
|
||||
"%s shadows %s followed by '%s'",
|
||||
token_name(advance_symbol).c_str(),
|
||||
token_name(accept_action.symbol).c_str(),
|
||||
log_char(*conflicting_following_chars.included_chars.begin())
|
||||
);
|
||||
if (!conflicting_following_chars.included_chars.empty()) {
|
||||
LOG(
|
||||
"%s shadows %s followed by '%s'",
|
||||
token_name(advance_symbol).c_str(),
|
||||
token_name(accept_action.symbol).c_str(),
|
||||
log_char(*conflicting_following_chars.included_chars.begin())
|
||||
);
|
||||
}
|
||||
}
|
||||
}
|
||||
}
|
||||
|
|
@ -665,8 +646,12 @@ LexTableBuilder::BuildResult LexTableBuilder::build() {
|
|||
return static_cast<LexTableBuilderImpl *>(this)->build();
|
||||
}
|
||||
|
||||
ConflictStatus LexTableBuilder::get_conflict_status(Symbol a, Symbol b) const {
|
||||
return static_cast<const LexTableBuilderImpl *>(this)->get_conflict_status(a, b);
|
||||
bool LexTableBuilder::does_token_shadow_other(Symbol a, Symbol b) const {
|
||||
return static_cast<const LexTableBuilderImpl *>(this)->does_token_shadow_other(a, b);
|
||||
}
|
||||
|
||||
bool LexTableBuilder::does_token_match_same_string_as_other(Symbol a, Symbol b) const {
|
||||
return static_cast<const LexTableBuilderImpl *>(this)->does_token_match_same_string_as_other(a, b);
|
||||
}
|
||||
|
||||
} // namespace build_tables
|
||||
|
|
|
|||
|
|
@ -30,19 +30,6 @@ namespace build_tables {
|
|||
|
||||
class LookaheadSet;
|
||||
|
||||
enum ConflictStatus {
|
||||
DoesNotMatch = 0,
|
||||
MatchesShorterStringWithinSeparators = 1 << 0,
|
||||
MatchesSameString = 1 << 1,
|
||||
MatchesLongerString = 1 << 2,
|
||||
MatchesLongerStringWithValidNextChar = 1 << 3,
|
||||
CannotDistinguish = (
|
||||
MatchesShorterStringWithinSeparators |
|
||||
MatchesSameString |
|
||||
MatchesLongerStringWithValidNextChar
|
||||
),
|
||||
};
|
||||
|
||||
struct CoincidentTokenIndex {
|
||||
std::unordered_map<
|
||||
std::pair<rules::Symbol::Index, rules::Symbol::Index>,
|
||||
|
|
@ -69,7 +56,8 @@ class LexTableBuilder {
|
|||
|
||||
BuildResult build();
|
||||
|
||||
ConflictStatus get_conflict_status(rules::Symbol, rules::Symbol) const;
|
||||
bool does_token_shadow_other(rules::Symbol, rules::Symbol) const;
|
||||
bool does_token_match_same_string_as_other(rules::Symbol, rules::Symbol) const;
|
||||
|
||||
protected:
|
||||
LexTableBuilder() = default;
|
||||
|
|
|
|||
|
|
@ -134,11 +134,6 @@ class ParseTableBuilderImpl : public ParseTableBuilder {
|
|||
}
|
||||
|
||||
void build_error_parse_state(ParseStateId state_id) {
|
||||
unsigned CannotMerge = (
|
||||
MatchesShorterStringWithinSeparators |
|
||||
MatchesLongerStringWithValidNextChar
|
||||
);
|
||||
|
||||
parse_table.states[state_id].terminal_entries.clear();
|
||||
|
||||
// First, identify the conflict-free tokens.
|
||||
|
|
@ -149,7 +144,7 @@ class ParseTableBuilderImpl : public ParseTableBuilder {
|
|||
for (unsigned j = 0; j < lexical_grammar.variables.size(); j++) {
|
||||
Symbol other_token = Symbol::terminal(j);
|
||||
if (!coincident_token_index.contains(token, other_token) &&
|
||||
(lex_table_builder->get_conflict_status(other_token, token) & CannotMerge)) {
|
||||
lex_table_builder->does_token_shadow_other(token, other_token)) {
|
||||
conflicts_with_other_tokens = true;
|
||||
break;
|
||||
}
|
||||
|
|
@ -171,7 +166,7 @@ class ParseTableBuilderImpl : public ParseTableBuilder {
|
|||
bool conflicts_with_other_tokens = false;
|
||||
conflict_free_tokens.for_each([&](Symbol other_token) {
|
||||
if (!coincident_token_index.contains(token, other_token) &&
|
||||
(lex_table_builder->get_conflict_status(other_token, token) & CannotMerge)) {
|
||||
lex_table_builder->does_token_shadow_other(token, other_token)) {
|
||||
LOG(
|
||||
"exclude %s: conflicts with %s",
|
||||
symbol_name(token).c_str(),
|
||||
|
|
@ -517,7 +512,8 @@ class ParseTableBuilderImpl : public ParseTableBuilder {
|
|||
// Do not add a token if it conflicts with an existing token.
|
||||
if (!new_token.is_built_in()) {
|
||||
for (const auto &entry : state.terminal_entries) {
|
||||
if (lex_table_builder->get_conflict_status(entry.first, new_token) & CannotDistinguish) {
|
||||
if (lex_table_builder->does_token_shadow_other(new_token, entry.first) ||
|
||||
lex_table_builder->does_token_match_same_string_as_other(new_token, entry.first)) {
|
||||
LOG_IF(
|
||||
logged_conflict_tokens.insert({entry.first, new_token}).second,
|
||||
"cannot merge parse states due to token conflict: %s and %s",
|
||||
|
|
|
|||
|
|
@ -32,6 +32,7 @@ struct InputGrammar {
|
|||
std::vector<std::unordered_set<rules::NamedSymbol>> expected_conflicts;
|
||||
std::vector<rules::Rule> external_tokens;
|
||||
std::unordered_set<rules::NamedSymbol> variables_to_inline;
|
||||
rules::NamedSymbol word_rule;
|
||||
};
|
||||
|
||||
} // namespace tree_sitter
|
||||
|
|
|
|||
|
|
@ -1,4 +1,5 @@
|
|||
#include "compiler/log.h"
|
||||
#include <cassert>
|
||||
|
||||
static const char *SPACES = " ";
|
||||
|
||||
|
|
@ -21,6 +22,7 @@ void _indent_logs() {
|
|||
}
|
||||
|
||||
void _outdent_logs() {
|
||||
assert(_indent_level > 0);
|
||||
_indent_level--;
|
||||
}
|
||||
|
||||
|
|
|
|||
|
|
@ -229,7 +229,9 @@ ParseGrammarResult parse_grammar(const string &input) {
|
|||
string error_message;
|
||||
string name;
|
||||
InputGrammar grammar;
|
||||
json_value name_json, rules_json, extras_json, conflicts_json, external_tokens_json, inline_rules_json;
|
||||
json_value
|
||||
name_json, rules_json, extras_json, conflicts_json, external_tokens_json,
|
||||
inline_rules_json, word_rule_json;
|
||||
|
||||
json_settings settings = { 0, json_enable_comments, 0, 0, 0, 0 };
|
||||
char parse_error[json_error_max];
|
||||
|
|
@ -359,6 +361,16 @@ ParseGrammarResult parse_grammar(const string &input) {
|
|||
}
|
||||
}
|
||||
|
||||
word_rule_json = grammar_json->operator[]("word");
|
||||
if (word_rule_json.type != json_none) {
|
||||
if (word_rule_json.type != json_string) {
|
||||
error_message = "Invalid word property";
|
||||
goto error;
|
||||
}
|
||||
|
||||
grammar.word_rule = NamedSymbol { word_rule_json.u.string.ptr };
|
||||
}
|
||||
|
||||
json_value_free(grammar_json);
|
||||
return { name, grammar, "" };
|
||||
|
||||
|
|
|
|||
|
|
@ -106,6 +106,7 @@ InitialSyntaxGrammar expand_repeats(const InitialSyntaxGrammar &grammar) {
|
|||
expander.aux_rules.end()
|
||||
);
|
||||
|
||||
result.word_rule = grammar.word_rule;
|
||||
return result;
|
||||
}
|
||||
|
||||
|
|
|
|||
|
|
@ -329,6 +329,18 @@ tuple<InitialSyntaxGrammar, LexicalGrammar, CompileError> extract_tokens(
|
|||
}
|
||||
}
|
||||
|
||||
syntax_grammar.word_rule = symbol_replacer.replace_symbol(grammar.word_rule);
|
||||
if (syntax_grammar.word_rule.is_non_terminal()) {
|
||||
return make_tuple(
|
||||
syntax_grammar,
|
||||
lexical_grammar,
|
||||
CompileError(
|
||||
TSCompileErrorTypeInvalidWordRule,
|
||||
"Word rules must be tokens"
|
||||
)
|
||||
);
|
||||
}
|
||||
|
||||
return make_tuple(syntax_grammar, lexical_grammar, CompileError::none());
|
||||
}
|
||||
|
||||
|
|
|
|||
|
|
@ -161,6 +161,8 @@ pair<SyntaxGrammar, CompileError> flatten_grammar(const InitialSyntaxGrammar &gr
|
|||
i++;
|
||||
}
|
||||
|
||||
result.word_rule = grammar.word_rule;
|
||||
|
||||
return {result, CompileError::none()};
|
||||
}
|
||||
|
||||
|
|
|
|||
|
|
@ -17,6 +17,7 @@ struct InitialSyntaxGrammar {
|
|||
std::set<std::set<rules::Symbol>> expected_conflicts;
|
||||
std::vector<ExternalToken> external_tokens;
|
||||
std::set<rules::Symbol> variables_to_inline;
|
||||
rules::Symbol word_rule;
|
||||
};
|
||||
|
||||
} // namespace prepare_grammar
|
||||
|
|
|
|||
|
|
@ -166,6 +166,8 @@ pair<InternedGrammar, CompileError> intern_symbols(const InputGrammar &grammar)
|
|||
}
|
||||
}
|
||||
|
||||
result.word_rule = interner.intern_symbol(grammar.word_rule);
|
||||
|
||||
return {result, CompileError::none()};
|
||||
}
|
||||
|
||||
|
|
|
|||
|
|
@ -15,8 +15,8 @@ struct InternedGrammar {
|
|||
std::vector<rules::Rule> extra_tokens;
|
||||
std::set<std::set<rules::Symbol>> expected_conflicts;
|
||||
std::vector<Variable> external_tokens;
|
||||
std::set<rules::Symbol> blank_external_tokens;
|
||||
std::set<rules::Symbol> variables_to_inline;
|
||||
rules::Symbol word_rule;
|
||||
};
|
||||
|
||||
} // namespace prepare_grammar
|
||||
|
|
|
|||
|
|
@ -60,6 +60,7 @@ struct SyntaxGrammar {
|
|||
std::set<std::set<rules::Symbol>> expected_conflicts;
|
||||
std::vector<ExternalToken> external_tokens;
|
||||
std::set<rules::Symbol> variables_to_inline;
|
||||
rules::Symbol word_rule;
|
||||
};
|
||||
|
||||
} // namespace tree_sitter
|
||||
|
|
|
|||
|
|
@ -110,7 +110,7 @@ static inline void array__grow(VoidArray *self, size_t element_size) {
|
|||
|
||||
static inline void array__splice(VoidArray *self, size_t element_size,
|
||||
uint32_t index, uint32_t old_count,
|
||||
uint32_t new_count, void *elements) {
|
||||
uint32_t new_count, const void *elements) {
|
||||
uint32_t new_size = self->size + new_count - old_count;
|
||||
uint32_t old_end = index + old_count;
|
||||
uint32_t new_end = index + new_count;
|
||||
|
|
|
|||
|
|
@ -28,11 +28,11 @@ static inline TSNode ts_node__null() {
|
|||
|
||||
// TSNode - accessors
|
||||
|
||||
uint32_t ts_node_start_byte(const TSNode self) {
|
||||
uint32_t ts_node_start_byte(TSNode self) {
|
||||
return self.context[0];
|
||||
}
|
||||
|
||||
TSPoint ts_node_start_point(const TSNode self) {
|
||||
TSPoint ts_node_start_point(TSNode self) {
|
||||
return (TSPoint) {self.context[1], self.context[2]};
|
||||
}
|
||||
|
||||
|
|
|
|||
|
|
@ -59,7 +59,7 @@ bool ts_external_scanner_state_eq(const ExternalScannerState *a, const ExternalS
|
|||
// SubtreeArray
|
||||
|
||||
bool ts_subtree_array_copy(SubtreeArray self, SubtreeArray *dest) {
|
||||
const Subtree **contents = NULL;
|
||||
Subtree **contents = NULL;
|
||||
if (self.capacity > 0) {
|
||||
contents = ts_calloc(self.capacity, sizeof(Subtree *));
|
||||
memcpy(contents, self.contents, self.size * sizeof(Subtree *));
|
||||
|
|
|
|||
|
|
@ -25,7 +25,8 @@ describe("ParseItemSetBuilder", []() {
|
|||
LexicalGrammar lexical_grammar{lexical_variables, {}};
|
||||
|
||||
it("adds items at the beginnings of referenced rules", [&]() {
|
||||
SyntaxGrammar grammar{{
|
||||
SyntaxGrammar grammar;
|
||||
grammar.variables = {
|
||||
SyntaxVariable{"rule0", VariableTypeNamed, {
|
||||
Production({
|
||||
{Symbol::non_terminal(1), 0, AssociativityNone, Alias{}},
|
||||
|
|
@ -47,7 +48,7 @@ describe("ParseItemSetBuilder", []() {
|
|||
{Symbol::terminal(15), 0, AssociativityNone, Alias{}},
|
||||
}, 0)
|
||||
}},
|
||||
}, {}, {}, {}, {}};
|
||||
};
|
||||
|
||||
auto production = [&](int variable_index, int production_index) -> const Production & {
|
||||
return grammar.variables[variable_index].productions[production_index];
|
||||
|
|
@ -84,7 +85,8 @@ describe("ParseItemSetBuilder", []() {
|
|||
});
|
||||
|
||||
it("handles rules with empty productions", [&]() {
|
||||
SyntaxGrammar grammar{{
|
||||
SyntaxGrammar grammar;
|
||||
grammar.variables = {
|
||||
SyntaxVariable{"rule0", VariableTypeNamed, {
|
||||
Production({
|
||||
{Symbol::non_terminal(1), 0, AssociativityNone, Alias{}},
|
||||
|
|
@ -98,7 +100,7 @@ describe("ParseItemSetBuilder", []() {
|
|||
}, 0),
|
||||
Production{{}, 0}
|
||||
}},
|
||||
}, {}, {}, {}, {}};
|
||||
};
|
||||
|
||||
auto production = [&](int variable_index, int production_index) -> const Production & {
|
||||
return grammar.variables[variable_index].productions[production_index];
|
||||
|
|
|
|||
|
|
@ -11,11 +11,9 @@ START_TEST
|
|||
|
||||
describe("expand_repeats", []() {
|
||||
it("replaces repeat rules with pairs of recursive rules", [&]() {
|
||||
InitialSyntaxGrammar grammar{
|
||||
{
|
||||
Variable{"rule0", VariableTypeNamed, Repeat{Symbol::terminal(0)}},
|
||||
},
|
||||
{}, {}, {}, {}
|
||||
InitialSyntaxGrammar grammar;
|
||||
grammar.variables = {
|
||||
Variable{"rule0", VariableTypeNamed, Repeat{Symbol::terminal(0)}},
|
||||
};
|
||||
|
||||
auto result = expand_repeats(grammar);
|
||||
|
|
@ -30,14 +28,12 @@ describe("expand_repeats", []() {
|
|||
});
|
||||
|
||||
it("replaces repeats inside of sequences", [&]() {
|
||||
InitialSyntaxGrammar grammar{
|
||||
{
|
||||
Variable{"rule0", VariableTypeNamed, Rule::seq({
|
||||
Symbol::terminal(10),
|
||||
Repeat{Symbol::terminal(11)},
|
||||
})},
|
||||
},
|
||||
{}, {}, {}, {}
|
||||
InitialSyntaxGrammar grammar;
|
||||
grammar.variables = {
|
||||
Variable{"rule0", VariableTypeNamed, Rule::seq({
|
||||
Symbol::terminal(10),
|
||||
Repeat{Symbol::terminal(11)},
|
||||
})},
|
||||
};
|
||||
|
||||
auto result = expand_repeats(grammar);
|
||||
|
|
@ -55,14 +51,12 @@ describe("expand_repeats", []() {
|
|||
});
|
||||
|
||||
it("replaces repeats inside of choices", [&]() {
|
||||
InitialSyntaxGrammar grammar{
|
||||
{
|
||||
Variable{"rule0", VariableTypeNamed, Rule::choice({
|
||||
Symbol::terminal(10),
|
||||
Repeat{Symbol::terminal(11)}
|
||||
})},
|
||||
},
|
||||
{}, {}, {}, {}
|
||||
InitialSyntaxGrammar grammar;
|
||||
grammar.variables = {
|
||||
Variable{"rule0", VariableTypeNamed, Rule::choice({
|
||||
Symbol::terminal(10),
|
||||
Repeat{Symbol::terminal(11)}
|
||||
})},
|
||||
};
|
||||
|
||||
auto result = expand_repeats(grammar);
|
||||
|
|
@ -80,18 +74,16 @@ describe("expand_repeats", []() {
|
|||
});
|
||||
|
||||
it("does not create redundant auxiliary rules", [&]() {
|
||||
InitialSyntaxGrammar grammar{
|
||||
{
|
||||
Variable{"rule0", VariableTypeNamed, Rule::choice({
|
||||
Rule::seq({ Symbol::terminal(1), Repeat{Symbol::terminal(4)} }),
|
||||
Rule::seq({ Symbol::terminal(2), Repeat{Symbol::terminal(4)} }),
|
||||
})},
|
||||
Variable{"rule1", VariableTypeNamed, Rule::seq({
|
||||
Symbol::terminal(3),
|
||||
Repeat{Symbol::terminal(4)}
|
||||
})},
|
||||
},
|
||||
{}, {}, {}, {}
|
||||
InitialSyntaxGrammar grammar;
|
||||
grammar.variables = {
|
||||
Variable{"rule0", VariableTypeNamed, Rule::choice({
|
||||
Rule::seq({ Symbol::terminal(1), Repeat{Symbol::terminal(4)} }),
|
||||
Rule::seq({ Symbol::terminal(2), Repeat{Symbol::terminal(4)} }),
|
||||
})},
|
||||
Variable{"rule1", VariableTypeNamed, Rule::seq({
|
||||
Symbol::terminal(3),
|
||||
Repeat{Symbol::terminal(4)}
|
||||
})},
|
||||
};
|
||||
|
||||
auto result = expand_repeats(grammar);
|
||||
|
|
@ -113,14 +105,14 @@ describe("expand_repeats", []() {
|
|||
});
|
||||
|
||||
it("can replace multiple repeats in the same rule", [&]() {
|
||||
InitialSyntaxGrammar grammar{
|
||||
InitialSyntaxGrammar grammar;
|
||||
grammar.variables = {
|
||||
{
|
||||
Variable{"rule0", VariableTypeNamed, Rule::seq({
|
||||
Repeat{Symbol::terminal(10)},
|
||||
Repeat{Symbol::terminal(11)},
|
||||
})},
|
||||
},
|
||||
{}, {}, {}, {}
|
||||
}
|
||||
};
|
||||
|
||||
auto result = expand_repeats(grammar);
|
||||
|
|
@ -142,12 +134,10 @@ describe("expand_repeats", []() {
|
|||
});
|
||||
|
||||
it("can replace repeats in multiple rules", [&]() {
|
||||
InitialSyntaxGrammar grammar{
|
||||
{
|
||||
Variable{"rule0", VariableTypeNamed, Repeat{Symbol::terminal(10)}},
|
||||
Variable{"rule1", VariableTypeNamed, Repeat{Symbol::terminal(11)}},
|
||||
},
|
||||
{}, {}, {}, {}
|
||||
InitialSyntaxGrammar grammar;
|
||||
grammar.variables = {
|
||||
Variable{"rule0", VariableTypeNamed, Repeat{Symbol::terminal(10)}},
|
||||
Variable{"rule1", VariableTypeNamed, Repeat{Symbol::terminal(11)}},
|
||||
};
|
||||
|
||||
auto result = expand_repeats(grammar);
|
||||
|
|
|
|||
|
|
@ -11,13 +11,11 @@ using prepare_grammar::intern_symbols;
|
|||
|
||||
describe("intern_symbols", []() {
|
||||
it("replaces named symbols with numerically-indexed symbols", [&]() {
|
||||
InputGrammar grammar{
|
||||
{
|
||||
{"x", VariableTypeNamed, Rule::choice({ NamedSymbol{"y"}, NamedSymbol{"_z"} })},
|
||||
{"y", VariableTypeNamed, NamedSymbol{"_z"}},
|
||||
{"_z", VariableTypeNamed, String{"stuff"}}
|
||||
},
|
||||
{}, {}, {}, {}
|
||||
InputGrammar grammar;
|
||||
grammar.variables = {
|
||||
{"x", VariableTypeNamed, Rule::choice({ NamedSymbol{"y"}, NamedSymbol{"_z"} })},
|
||||
{"y", VariableTypeNamed, NamedSymbol{"_z"}},
|
||||
{"_z", VariableTypeNamed, String{"stuff"}}
|
||||
};
|
||||
|
||||
auto result = intern_symbols(grammar);
|
||||
|
|
@ -32,11 +30,9 @@ describe("intern_symbols", []() {
|
|||
|
||||
describe("when there are symbols that reference undefined rules", [&]() {
|
||||
it("returns an error", []() {
|
||||
InputGrammar grammar{
|
||||
{
|
||||
{"x", VariableTypeNamed, NamedSymbol{"y"}},
|
||||
},
|
||||
{}, {}, {}, {}
|
||||
InputGrammar grammar;
|
||||
grammar.variables = {
|
||||
{"x", VariableTypeNamed, NamedSymbol{"y"}},
|
||||
};
|
||||
|
||||
auto result = intern_symbols(grammar);
|
||||
|
|
@ -46,16 +42,14 @@ describe("intern_symbols", []() {
|
|||
});
|
||||
|
||||
it("translates the grammar's optional 'extra_tokens' to numerical symbols", [&]() {
|
||||
InputGrammar grammar{
|
||||
{
|
||||
{"x", VariableTypeNamed, Rule::choice({ NamedSymbol{"y"}, NamedSymbol{"z"} })},
|
||||
{"y", VariableTypeNamed, NamedSymbol{"z"}},
|
||||
{"z", VariableTypeNamed, String{"stuff"}}
|
||||
},
|
||||
{
|
||||
NamedSymbol{"z"}
|
||||
},
|
||||
{}, {}, {}
|
||||
InputGrammar grammar;
|
||||
grammar.variables = {
|
||||
{"x", VariableTypeNamed, Rule::choice({ NamedSymbol{"y"}, NamedSymbol{"z"} })},
|
||||
{"y", VariableTypeNamed, NamedSymbol{"z"}},
|
||||
{"z", VariableTypeNamed, String{"stuff"}}
|
||||
};
|
||||
grammar.extra_tokens = {
|
||||
NamedSymbol{"z"}
|
||||
};
|
||||
|
||||
auto result = intern_symbols(grammar);
|
||||
|
|
@ -66,19 +60,15 @@ describe("intern_symbols", []() {
|
|||
});
|
||||
|
||||
it("records any rule names that match external token names", [&]() {
|
||||
InputGrammar grammar{
|
||||
{
|
||||
{"x", VariableTypeNamed, Rule::choice({ NamedSymbol{"y"}, NamedSymbol{"z"} })},
|
||||
{"y", VariableTypeNamed, NamedSymbol{"z"}},
|
||||
{"z", VariableTypeNamed, String{"stuff"}},
|
||||
},
|
||||
{},
|
||||
{},
|
||||
{
|
||||
NamedSymbol{"w"},
|
||||
NamedSymbol{"z"},
|
||||
},
|
||||
{}
|
||||
InputGrammar grammar;
|
||||
grammar.variables = {
|
||||
{"x", VariableTypeNamed, Rule::choice({ NamedSymbol{"y"}, NamedSymbol{"z"} })},
|
||||
{"y", VariableTypeNamed, NamedSymbol{"z"}},
|
||||
{"z", VariableTypeNamed, String{"stuff"}},
|
||||
};
|
||||
grammar.external_tokens = {
|
||||
NamedSymbol{"w"},
|
||||
NamedSymbol{"z"},
|
||||
};
|
||||
|
||||
auto result = intern_symbols(grammar);
|
||||
|
|
|
|||
Loading…
Add table
Add a link
Reference in a new issue