Merge pull request #176 from tree-sitter/explicit-word-token

Perform keyword optimization using explicitly selected word token
This commit is contained in:
Max Brunsfeld 2018-06-14 13:19:11 -07:00 committed by GitHub
commit 245052442a
No known key found for this signature in database
GPG key ID: 4AEE18F83AFDEB23
22 changed files with 305 additions and 247 deletions

View file

@ -32,6 +32,7 @@
<div id="current-page-table-of-contents">
{% capture whitespace %}
{% assign min_header = 2 %}
{% assign max_header = 3 %}
{% assign nodes = content | split: "<h" %}
{% assign first_header = true %}
{% for node in nodes %}
@ -41,7 +42,7 @@
{% assign header_level = node | replace: '"', '' | slice: 0, 1 | times: 1 %}
{% if header_level < min_header or header_level > maxHeader %}
{% if header_level < min_header or header_level > max_header %}
{% continue %}
{% endif %}
@ -127,7 +128,7 @@
}
});
$('h1, h2, h3, h4, h5, h6').filter('[id]').each(function() {
$('h1, h2, h3').filter('[id]').each(function() {
$(this).html('<a href="#'+$(this).attr('id')+'">' + $(this).text() + '</a>');
});
</script>

View file

@ -211,12 +211,13 @@ The following is a complete list of built-in functions you can use to define Tre
* **Tokens : `token(rule)`** - This function marks the given rule as producing only a single token. Tree-sitter's default is to treat each String or RegExp literal in the grammar as a separate token. Each token is matched separately by the lexer and returned as its own leaf node in the tree. The `token` function allows you to express a complex rule using the functions described above (rather than as a single regular expression) but still have Tree-sitter treat it as a single token.
* **Aliases : `alias(rule, name)`** - This function causes the given rule to *appear* with an alternative name in the syntax tree. It is useful in cases where a language construct needs to be parsed differently in different contexts (and thus needs to be defined using multiple symbols), but should always *appear* as the same type of node.
In addition to the `name` and `rules` fields, grammars have a few other public fields that influence the behavior of the parser.
In addition to the `name` and `rules` fields, grammars have a few other optional public fields that influence the behavior of the parser.
* `extras` - an array of tokens that may appear *anywhere* in the language. This is often used for whitespace and comments. The default for `extras` in `tree-sitter-cli` is to accept whitespace. To control whitespace explicitly, specify `extras=[]` in the grammar.
* `inline` - an array of rule names that should be automatically *removed* from the grammar by replacing all of their usages with a copy of their definition. This is useful for rules that are used in multiple places but for which you *don't* want to create syntax tree nodes at runtime.
* `conflicts` - an array of arrays of rule names. Each inner array represents a set of rules that's involved in an *LR(1) conflict* that is *intended to exist* in the grammar. When these conflicts occur at runtime, Tree-sitter will use the GLR algorithm to explore all of the possible interpretations. If *multiple* parses end up succeeding, Tree-sitter will pick the subtree rule with the highest *dynamic precedence*.
* `externals` - an array of toen names which can be returned by an *external scanner*. External scanners allow you to write custom C code which runs during the lexing process in order to handle lexical rules (e.g. Python's indentation tokens) that cannot be described by regular expressions.
* `word` - the name of a token that will match keywords for the purpose of the [keyword extraction](#keyword-extraction) optimization.
## Adjusting existing grammars
@ -355,11 +356,81 @@ For an expression like `a * b * c`, it's not clear whether we mean `a * (b * c)`
You may have noticed in the above examples that some of the grammar rule name like `_expression` and `_type` began with an underscore. Starting a rule's name with an underscore causes the rule to be *hidden* in the syntax tree. This is useful for rules like `_expression` in the grammars above, which always just wrap a single child node. If these nodes were not hidden, they would add substantial depth and noise to the syntax tree without making it any easier to understand.
## Dealing with LR conflicts
### Dealing with LR conflicts
TODO
...
## Lexical Analysis
Tree-sitter's parsing process is divided into two phases: parsing (which is described above) and [lexing](lexing) - the process of grouping individual characters into the language's fundamental *tokens*. There are a few important things to know about how Tree-sitter's lexing works.
### Conflict Resolution
Grammars often contain multiple tokens that can match the same characters. For example, a grammar might contain the tokens (`"if"` and `/[a-z]+/`). Tree-sitter differentiates between these conflicting tokens in a few ways:
1. **Context-aware lexing** - Tree-sitter performs lexing on-demand, during the parsing process. At any given position in a source document, the lexer only tries to recognize tokens that are *valid* at that position in the document.
2. **Longest-match** - If multiple valid tokens match the characters at a given position in a document, Tree-sitter will select the token that matches the [longest sequence of characters](longest-match).
3. **Lexical Precedence** - When the precedence functions described [above](#using-the-grammar-dsl) are used within the `token` function, the given precedence values serve as instructions to the lexer. If there are two valid tokens that match the same sequence of characters, Tree-sitter will select the one with the higher precedence.
### Keywords
If your language has keywords which are matched by a rule (typically `identifier`), you can tell Tree-sitter about it with your grammar's `word` property.
```js
grammar({
word: $ => $.identifier,
rules: {
class_declaration: $ => seq(
'class',
$.identifier,
$.class_body
),
break_statement: $ => seq('break', ';'),
continue_statement: $ => seq('continue', ';'),
identifier: $ => /[a-z]+/
}
})
```
In this case, we're specifying `identifier` as our `word`. Tree-sitter will automatically find the set of terminals which are matched by `$.identifier`, and consider them keywords. Instead of generating a parser which scans for each keyword individually, Tree-sitter will generate a parser that tries to match the word rule (in this case, `identifier`), and checks to see if the matched word is the necessary keyword.
This makes the set of parse states smaller, so the parser compiles faster.
It *also changes behavior*. Consider this grammar:
```js
grammar({
rules: {
import: $ => seq(
'import',
$.identifier,
'as',
$.identifier
),
identifier: $ => /[a-z]+/
}
})
```
Without the `word` directive, the grammar matches this input:
```
import foo asbar
```
Which is probably not what you want. If we add `word: $ => $.identifier`, this will no longer parse. When we try to parse `'as'`, we will parse a word — which will be the identifier ``'asbar'``—and then compare it to `'as'`, correctly generating an error.
[lexing]: https://en.wikipedia.org/wiki/Lexical_analysis
[longest-match]: https://en.wikipedia.org/wiki/Maximal_munch
[cst]: https://en.wikipedia.org/wiki/Parse_tree
[dfa]: https://en.wikipedia.org/wiki/Deterministic_finite_automaton
[non-terminal]: https://en.wikipedia.org/wiki/Terminal_and_nonterminal_symbols
[language-spec]: https://en.wikipedia.org/wiki/Programming_language_specification
[glr-parsing]: https://en.wikipedia.org/wiki/GLR_parser

View file

@ -19,6 +19,7 @@ typedef enum {
TSCompileErrorTypeEpsilonRule,
TSCompileErrorTypeInvalidTokenContents,
TSCompileErrorTypeInvalidRuleName,
TSCompileErrorTypeInvalidWordRule,
} TSCompileErrorType;
typedef struct {

View file

@ -49,6 +49,19 @@ using rules::Symbol;
using rules::Metadata;
using rules::Seq;
enum ConflictStatus {
DoesNotMatch = 0,
MatchesShorterStringWithinSeparators = 1 << 0,
MatchesSameString = 1 << 1,
MatchesLongerString = 1 << 2,
MatchesLongerStringWithValidNextChar = 1 << 3,
CannotDistinguish = (
MatchesShorterStringWithinSeparators |
MatchesSameString |
MatchesLongerStringWithValidNextChar
),
};
static const std::unordered_set<ParseStateId> EMPTY;
bool CoincidentTokenIndex::contains(Symbol a, Symbol b) const {
@ -65,14 +78,12 @@ const std::unordered_set<ParseStateId> &CoincidentTokenIndex::states_with(Symbol
}
}
template <bool include_all>
class CharacterAggregator {
class StartingCharacterAggregator {
public:
void apply(const Rule &rule) {
rule.match(
[this](const Seq &sequence) {
apply(*sequence.left);
if (include_all) apply(*sequence.right);
},
[this](const rules::Choice &rule) {
@ -91,9 +102,6 @@ class CharacterAggregator {
CharacterSet result;
};
using StartingCharacterAggregator = CharacterAggregator<false>;
using AllCharacterAggregator = CharacterAggregator<true>;
class LexTableBuilderImpl : public LexTableBuilder {
LexTable main_lex_table;
LexTable keyword_lex_table;
@ -109,7 +117,7 @@ class LexTableBuilderImpl : public LexTableBuilder {
vector<ConflictStatus> conflict_matrix;
bool conflict_detection_mode;
LookaheadSet keyword_symbols;
Symbol keyword_capture_token;
Symbol word_rule;
char encoding_buffer[8];
public:
@ -125,7 +133,7 @@ class LexTableBuilderImpl : public LexTableBuilder {
parse_table(parse_table),
conflict_matrix(lexical_grammar.variables.size() * lexical_grammar.variables.size(), DoesNotMatch),
conflict_detection_mode(false),
keyword_capture_token(rules::NONE()) {
word_rule(syntax_grammar.word_rule) {
// Compute the possible separator rules and the set of separator characters that can occur
// immediately after any token.
@ -141,7 +149,6 @@ class LexTableBuilderImpl : public LexTableBuilder {
// characters that can follow each token. Also identify all of the tokens that can be
// considered 'keywords'.
LOG_START("characterizing tokens");
LookaheadSet potential_keyword_symbols;
for (unsigned i = 0, n = grammar.variables.size(); i < n; i++) {
Symbol token = Symbol::terminal(i);
@ -158,31 +165,6 @@ class LexTableBuilderImpl : public LexTableBuilder {
});
}
following_characters_by_token[i] = following_character_aggregator.result;
AllCharacterAggregator all_character_aggregator;
all_character_aggregator.apply(grammar.variables[i].rule);
if (
!starting_character_aggregator.result.includes_all &&
!all_character_aggregator.result.includes_all
) {
bool starts_alpha = true, all_alnum = true;
for (auto character : starting_character_aggregator.result.included_chars) {
if (!iswalpha(character) && character != '_') {
starts_alpha = false;
}
}
for (auto character : all_character_aggregator.result.included_chars) {
if (!iswalnum(character) && character != '_') {
all_alnum = false;
}
}
if (starts_alpha && all_alnum) {
LOG("potential keyword: %s", token_name(token).c_str());
potential_keyword_symbols.insert(token);
}
}
}
LOG_END();
@ -205,98 +187,83 @@ class LexTableBuilderImpl : public LexTableBuilder {
}
LOG_END();
LOG_START("finding keyword capture token");
for (Symbol::Index i = 0, n = grammar.variables.size(); i < n; i++) {
Symbol candidate = Symbol::terminal(i);
if (word_rule != rules::NONE()) {
identify_keywords();
}
}
LookaheadSet homonyms;
potential_keyword_symbols.for_each([&](Symbol other_token) {
if (get_conflict_status(other_token, candidate) & MatchesShorterStringWithinSeparators) {
homonyms.clear();
return false;
}
if (get_conflict_status(candidate, other_token) == MatchesSameString) {
homonyms.insert(other_token);
}
return true;
});
if (homonyms.empty()) continue;
LOG_START(
"keyword capture token candidate: %s, homonym count: %lu",
token_name(candidate).c_str(),
homonyms.size()
);
homonyms.for_each([&](Symbol homonym1) {
homonyms.for_each([&](Symbol homonym2) {
if (get_conflict_status(homonym1, homonym2) & MatchesSameString) {
LOG(
"conflict between homonyms %s %s",
token_name(homonym1).c_str(),
token_name(homonym2).c_str()
);
homonyms.remove(homonym1);
}
return false;
});
return true;
});
for (Symbol::Index j = 0; j < n; j++) {
Symbol other_token = Symbol::terminal(j);
if (other_token == candidate || homonyms.contains(other_token)) continue;
bool candidate_shadows_other = get_conflict_status(other_token, candidate);
bool other_shadows_candidate = get_conflict_status(candidate, other_token);
if (candidate_shadows_other || other_shadows_candidate) {
homonyms.for_each([&](Symbol homonym) {
bool other_shadows_homonym = get_conflict_status(homonym, other_token);
bool candidate_was_already_present = true;
for (ParseStateId state_id : coincident_token_index.states_with(homonym, other_token)) {
if (!parse_table->states[state_id].has_terminal_entry(candidate)) {
candidate_was_already_present = false;
break;
}
}
if (candidate_was_already_present) return true;
if (candidate_shadows_other) {
homonyms.remove(homonym);
LOG(
"remove %s because candidate would shadow %s",
token_name(homonym).c_str(),
token_name(other_token).c_str()
);
} else if (other_shadows_candidate && !other_shadows_homonym) {
homonyms.remove(homonym);
LOG(
"remove %s because %s would shadow candidate",
token_name(homonym).c_str(),
token_name(other_token).c_str()
);
}
return true;
});
}
void identify_keywords() {
LookaheadSet homonyms;
for (Symbol::Index j = 0, n = grammar.variables.size(); j < n; j++) {
Symbol other_token = Symbol::terminal(j);
if (get_conflict_status(word_rule, other_token) == MatchesSameString) {
homonyms.insert(other_token);
}
if (homonyms.size() > keyword_symbols.size()) {
LOG_START("found capture token. homonyms:");
homonyms.for_each([&](Symbol homonym) {
LOG("%s", token_name(homonym).c_str());
return true;
});
LOG_END();
keyword_symbols = homonyms;
keyword_capture_token = candidate;
}
LOG_END();
}
LOG_END();
homonyms.for_each([&](Symbol homonym1) {
homonyms.for_each([&](Symbol homonym2) {
if (get_conflict_status(homonym1, homonym2) & MatchesSameString) {
LOG(
"conflict between homonyms %s %s",
token_name(homonym1).c_str(),
token_name(homonym2).c_str()
);
homonyms.remove(homonym1);
}
return false;
});
return true;
});
for (Symbol::Index j = 0, n = grammar.variables.size(); j < n; j++) {
Symbol other_token = Symbol::terminal(j);
if (other_token == word_rule || homonyms.contains(other_token)) continue;
bool word_rule_shadows_other = get_conflict_status(other_token, word_rule);
bool other_shadows_word_rule = get_conflict_status(word_rule, other_token);
if (word_rule_shadows_other || other_shadows_word_rule) {
homonyms.for_each([&](Symbol homonym) {
bool other_shadows_homonym = get_conflict_status(homonym, other_token);
bool word_rule_was_already_present = true;
for (ParseStateId state_id : coincident_token_index.states_with(homonym, other_token)) {
if (!parse_table->states[state_id].has_terminal_entry(word_rule)) {
word_rule_was_already_present = false;
break;
}
}
if (word_rule_was_already_present) return true;
if (word_rule_shadows_other) {
homonyms.remove(homonym);
LOG(
"remove %s because word_rule would shadow %s",
token_name(homonym).c_str(),
token_name(other_token).c_str()
);
} else if (other_shadows_word_rule && !other_shadows_homonym) {
homonyms.remove(homonym);
LOG(
"remove %s because %s would shadow word_rule",
token_name(homonym).c_str(),
token_name(other_token).c_str()
);
}
return true;
});
}
}
if (!homonyms.empty()) {
LOG_START("found keywords:");
homonyms.for_each([&](Symbol homonym) {
LOG("%s", token_name(homonym).c_str());
return true;
});
LOG_END();
keyword_symbols = homonyms;
}
}
BuildResult build() {
@ -307,8 +274,8 @@ class LexTableBuilderImpl : public LexTableBuilder {
for (ParseState &parse_state : parse_table->states) {
LookaheadSet token_set;
for (auto &entry : parse_state.terminal_entries) {
if (keyword_capture_token.is_terminal() && keyword_symbols.contains(entry.first)) {
token_set.insert(keyword_capture_token);
if (word_rule.is_terminal() && keyword_symbols.contains(entry.first)) {
token_set.insert(word_rule);
} else {
token_set.insert(entry.first);
}
@ -337,7 +304,19 @@ class LexTableBuilderImpl : public LexTableBuilder {
mark_fragile_tokens();
remove_duplicate_lex_states(main_lex_table);
return {main_lex_table, keyword_lex_table, keyword_capture_token};
return {main_lex_table, keyword_lex_table, word_rule};
}
bool does_token_shadow_other(Symbol token, Symbol shadowed_token) const {
if (token == word_rule && keyword_symbols.contains(shadowed_token)) return false;
return get_conflict_status(shadowed_token, token) & (
MatchesShorterStringWithinSeparators |
MatchesLongerStringWithValidNextChar
);
}
bool does_token_match_same_string_as_other(Symbol token, Symbol shadowed_token) const {
return get_conflict_status(shadowed_token, token) & MatchesSameString;
}
ConflictStatus get_conflict_status(Symbol shadowed_token, Symbol other_token) const {
@ -410,12 +389,14 @@ class LexTableBuilderImpl : public LexTableBuilder {
advance_symbol,
MatchesLongerStringWithValidNextChar
)) {
LOG(
"%s shadows %s followed by '%s'",
token_name(advance_symbol).c_str(),
token_name(accept_action.symbol).c_str(),
log_char(*conflicting_following_chars.included_chars.begin())
);
if (!conflicting_following_chars.included_chars.empty()) {
LOG(
"%s shadows %s followed by '%s'",
token_name(advance_symbol).c_str(),
token_name(accept_action.symbol).c_str(),
log_char(*conflicting_following_chars.included_chars.begin())
);
}
}
}
}
@ -665,8 +646,12 @@ LexTableBuilder::BuildResult LexTableBuilder::build() {
return static_cast<LexTableBuilderImpl *>(this)->build();
}
ConflictStatus LexTableBuilder::get_conflict_status(Symbol a, Symbol b) const {
return static_cast<const LexTableBuilderImpl *>(this)->get_conflict_status(a, b);
bool LexTableBuilder::does_token_shadow_other(Symbol a, Symbol b) const {
return static_cast<const LexTableBuilderImpl *>(this)->does_token_shadow_other(a, b);
}
bool LexTableBuilder::does_token_match_same_string_as_other(Symbol a, Symbol b) const {
return static_cast<const LexTableBuilderImpl *>(this)->does_token_match_same_string_as_other(a, b);
}
} // namespace build_tables

View file

@ -30,19 +30,6 @@ namespace build_tables {
class LookaheadSet;
enum ConflictStatus {
DoesNotMatch = 0,
MatchesShorterStringWithinSeparators = 1 << 0,
MatchesSameString = 1 << 1,
MatchesLongerString = 1 << 2,
MatchesLongerStringWithValidNextChar = 1 << 3,
CannotDistinguish = (
MatchesShorterStringWithinSeparators |
MatchesSameString |
MatchesLongerStringWithValidNextChar
),
};
struct CoincidentTokenIndex {
std::unordered_map<
std::pair<rules::Symbol::Index, rules::Symbol::Index>,
@ -69,7 +56,8 @@ class LexTableBuilder {
BuildResult build();
ConflictStatus get_conflict_status(rules::Symbol, rules::Symbol) const;
bool does_token_shadow_other(rules::Symbol, rules::Symbol) const;
bool does_token_match_same_string_as_other(rules::Symbol, rules::Symbol) const;
protected:
LexTableBuilder() = default;

View file

@ -134,11 +134,6 @@ class ParseTableBuilderImpl : public ParseTableBuilder {
}
void build_error_parse_state(ParseStateId state_id) {
unsigned CannotMerge = (
MatchesShorterStringWithinSeparators |
MatchesLongerStringWithValidNextChar
);
parse_table.states[state_id].terminal_entries.clear();
// First, identify the conflict-free tokens.
@ -149,7 +144,7 @@ class ParseTableBuilderImpl : public ParseTableBuilder {
for (unsigned j = 0; j < lexical_grammar.variables.size(); j++) {
Symbol other_token = Symbol::terminal(j);
if (!coincident_token_index.contains(token, other_token) &&
(lex_table_builder->get_conflict_status(other_token, token) & CannotMerge)) {
lex_table_builder->does_token_shadow_other(token, other_token)) {
conflicts_with_other_tokens = true;
break;
}
@ -171,7 +166,7 @@ class ParseTableBuilderImpl : public ParseTableBuilder {
bool conflicts_with_other_tokens = false;
conflict_free_tokens.for_each([&](Symbol other_token) {
if (!coincident_token_index.contains(token, other_token) &&
(lex_table_builder->get_conflict_status(other_token, token) & CannotMerge)) {
lex_table_builder->does_token_shadow_other(token, other_token)) {
LOG(
"exclude %s: conflicts with %s",
symbol_name(token).c_str(),
@ -517,7 +512,8 @@ class ParseTableBuilderImpl : public ParseTableBuilder {
// Do not add a token if it conflicts with an existing token.
if (!new_token.is_built_in()) {
for (const auto &entry : state.terminal_entries) {
if (lex_table_builder->get_conflict_status(entry.first, new_token) & CannotDistinguish) {
if (lex_table_builder->does_token_shadow_other(new_token, entry.first) ||
lex_table_builder->does_token_match_same_string_as_other(new_token, entry.first)) {
LOG_IF(
logged_conflict_tokens.insert({entry.first, new_token}).second,
"cannot merge parse states due to token conflict: %s and %s",

View file

@ -32,6 +32,7 @@ struct InputGrammar {
std::vector<std::unordered_set<rules::NamedSymbol>> expected_conflicts;
std::vector<rules::Rule> external_tokens;
std::unordered_set<rules::NamedSymbol> variables_to_inline;
rules::NamedSymbol word_rule;
};
} // namespace tree_sitter

View file

@ -1,4 +1,5 @@
#include "compiler/log.h"
#include <cassert>
static const char *SPACES = " ";
@ -21,6 +22,7 @@ void _indent_logs() {
}
void _outdent_logs() {
assert(_indent_level > 0);
_indent_level--;
}

View file

@ -229,7 +229,9 @@ ParseGrammarResult parse_grammar(const string &input) {
string error_message;
string name;
InputGrammar grammar;
json_value name_json, rules_json, extras_json, conflicts_json, external_tokens_json, inline_rules_json;
json_value
name_json, rules_json, extras_json, conflicts_json, external_tokens_json,
inline_rules_json, word_rule_json;
json_settings settings = { 0, json_enable_comments, 0, 0, 0, 0 };
char parse_error[json_error_max];
@ -359,6 +361,16 @@ ParseGrammarResult parse_grammar(const string &input) {
}
}
word_rule_json = grammar_json->operator[]("word");
if (word_rule_json.type != json_none) {
if (word_rule_json.type != json_string) {
error_message = "Invalid word property";
goto error;
}
grammar.word_rule = NamedSymbol { word_rule_json.u.string.ptr };
}
json_value_free(grammar_json);
return { name, grammar, "" };

View file

@ -106,6 +106,7 @@ InitialSyntaxGrammar expand_repeats(const InitialSyntaxGrammar &grammar) {
expander.aux_rules.end()
);
result.word_rule = grammar.word_rule;
return result;
}

View file

@ -329,6 +329,18 @@ tuple<InitialSyntaxGrammar, LexicalGrammar, CompileError> extract_tokens(
}
}
syntax_grammar.word_rule = symbol_replacer.replace_symbol(grammar.word_rule);
if (syntax_grammar.word_rule.is_non_terminal()) {
return make_tuple(
syntax_grammar,
lexical_grammar,
CompileError(
TSCompileErrorTypeInvalidWordRule,
"Word rules must be tokens"
)
);
}
return make_tuple(syntax_grammar, lexical_grammar, CompileError::none());
}

View file

@ -161,6 +161,8 @@ pair<SyntaxGrammar, CompileError> flatten_grammar(const InitialSyntaxGrammar &gr
i++;
}
result.word_rule = grammar.word_rule;
return {result, CompileError::none()};
}

View file

@ -17,6 +17,7 @@ struct InitialSyntaxGrammar {
std::set<std::set<rules::Symbol>> expected_conflicts;
std::vector<ExternalToken> external_tokens;
std::set<rules::Symbol> variables_to_inline;
rules::Symbol word_rule;
};
} // namespace prepare_grammar

View file

@ -166,6 +166,8 @@ pair<InternedGrammar, CompileError> intern_symbols(const InputGrammar &grammar)
}
}
result.word_rule = interner.intern_symbol(grammar.word_rule);
return {result, CompileError::none()};
}

View file

@ -15,8 +15,8 @@ struct InternedGrammar {
std::vector<rules::Rule> extra_tokens;
std::set<std::set<rules::Symbol>> expected_conflicts;
std::vector<Variable> external_tokens;
std::set<rules::Symbol> blank_external_tokens;
std::set<rules::Symbol> variables_to_inline;
rules::Symbol word_rule;
};
} // namespace prepare_grammar

View file

@ -60,6 +60,7 @@ struct SyntaxGrammar {
std::set<std::set<rules::Symbol>> expected_conflicts;
std::vector<ExternalToken> external_tokens;
std::set<rules::Symbol> variables_to_inline;
rules::Symbol word_rule;
};
} // namespace tree_sitter

View file

@ -110,7 +110,7 @@ static inline void array__grow(VoidArray *self, size_t element_size) {
static inline void array__splice(VoidArray *self, size_t element_size,
uint32_t index, uint32_t old_count,
uint32_t new_count, void *elements) {
uint32_t new_count, const void *elements) {
uint32_t new_size = self->size + new_count - old_count;
uint32_t old_end = index + old_count;
uint32_t new_end = index + new_count;

View file

@ -28,11 +28,11 @@ static inline TSNode ts_node__null() {
// TSNode - accessors
uint32_t ts_node_start_byte(const TSNode self) {
uint32_t ts_node_start_byte(TSNode self) {
return self.context[0];
}
TSPoint ts_node_start_point(const TSNode self) {
TSPoint ts_node_start_point(TSNode self) {
return (TSPoint) {self.context[1], self.context[2]};
}

View file

@ -59,7 +59,7 @@ bool ts_external_scanner_state_eq(const ExternalScannerState *a, const ExternalS
// SubtreeArray
bool ts_subtree_array_copy(SubtreeArray self, SubtreeArray *dest) {
const Subtree **contents = NULL;
Subtree **contents = NULL;
if (self.capacity > 0) {
contents = ts_calloc(self.capacity, sizeof(Subtree *));
memcpy(contents, self.contents, self.size * sizeof(Subtree *));

View file

@ -25,7 +25,8 @@ describe("ParseItemSetBuilder", []() {
LexicalGrammar lexical_grammar{lexical_variables, {}};
it("adds items at the beginnings of referenced rules", [&]() {
SyntaxGrammar grammar{{
SyntaxGrammar grammar;
grammar.variables = {
SyntaxVariable{"rule0", VariableTypeNamed, {
Production({
{Symbol::non_terminal(1), 0, AssociativityNone, Alias{}},
@ -47,7 +48,7 @@ describe("ParseItemSetBuilder", []() {
{Symbol::terminal(15), 0, AssociativityNone, Alias{}},
}, 0)
}},
}, {}, {}, {}, {}};
};
auto production = [&](int variable_index, int production_index) -> const Production & {
return grammar.variables[variable_index].productions[production_index];
@ -84,7 +85,8 @@ describe("ParseItemSetBuilder", []() {
});
it("handles rules with empty productions", [&]() {
SyntaxGrammar grammar{{
SyntaxGrammar grammar;
grammar.variables = {
SyntaxVariable{"rule0", VariableTypeNamed, {
Production({
{Symbol::non_terminal(1), 0, AssociativityNone, Alias{}},
@ -98,7 +100,7 @@ describe("ParseItemSetBuilder", []() {
}, 0),
Production{{}, 0}
}},
}, {}, {}, {}, {}};
};
auto production = [&](int variable_index, int production_index) -> const Production & {
return grammar.variables[variable_index].productions[production_index];

View file

@ -11,11 +11,9 @@ START_TEST
describe("expand_repeats", []() {
it("replaces repeat rules with pairs of recursive rules", [&]() {
InitialSyntaxGrammar grammar{
{
Variable{"rule0", VariableTypeNamed, Repeat{Symbol::terminal(0)}},
},
{}, {}, {}, {}
InitialSyntaxGrammar grammar;
grammar.variables = {
Variable{"rule0", VariableTypeNamed, Repeat{Symbol::terminal(0)}},
};
auto result = expand_repeats(grammar);
@ -30,14 +28,12 @@ describe("expand_repeats", []() {
});
it("replaces repeats inside of sequences", [&]() {
InitialSyntaxGrammar grammar{
{
Variable{"rule0", VariableTypeNamed, Rule::seq({
Symbol::terminal(10),
Repeat{Symbol::terminal(11)},
})},
},
{}, {}, {}, {}
InitialSyntaxGrammar grammar;
grammar.variables = {
Variable{"rule0", VariableTypeNamed, Rule::seq({
Symbol::terminal(10),
Repeat{Symbol::terminal(11)},
})},
};
auto result = expand_repeats(grammar);
@ -55,14 +51,12 @@ describe("expand_repeats", []() {
});
it("replaces repeats inside of choices", [&]() {
InitialSyntaxGrammar grammar{
{
Variable{"rule0", VariableTypeNamed, Rule::choice({
Symbol::terminal(10),
Repeat{Symbol::terminal(11)}
})},
},
{}, {}, {}, {}
InitialSyntaxGrammar grammar;
grammar.variables = {
Variable{"rule0", VariableTypeNamed, Rule::choice({
Symbol::terminal(10),
Repeat{Symbol::terminal(11)}
})},
};
auto result = expand_repeats(grammar);
@ -80,18 +74,16 @@ describe("expand_repeats", []() {
});
it("does not create redundant auxiliary rules", [&]() {
InitialSyntaxGrammar grammar{
{
Variable{"rule0", VariableTypeNamed, Rule::choice({
Rule::seq({ Symbol::terminal(1), Repeat{Symbol::terminal(4)} }),
Rule::seq({ Symbol::terminal(2), Repeat{Symbol::terminal(4)} }),
})},
Variable{"rule1", VariableTypeNamed, Rule::seq({
Symbol::terminal(3),
Repeat{Symbol::terminal(4)}
})},
},
{}, {}, {}, {}
InitialSyntaxGrammar grammar;
grammar.variables = {
Variable{"rule0", VariableTypeNamed, Rule::choice({
Rule::seq({ Symbol::terminal(1), Repeat{Symbol::terminal(4)} }),
Rule::seq({ Symbol::terminal(2), Repeat{Symbol::terminal(4)} }),
})},
Variable{"rule1", VariableTypeNamed, Rule::seq({
Symbol::terminal(3),
Repeat{Symbol::terminal(4)}
})},
};
auto result = expand_repeats(grammar);
@ -113,14 +105,14 @@ describe("expand_repeats", []() {
});
it("can replace multiple repeats in the same rule", [&]() {
InitialSyntaxGrammar grammar{
InitialSyntaxGrammar grammar;
grammar.variables = {
{
Variable{"rule0", VariableTypeNamed, Rule::seq({
Repeat{Symbol::terminal(10)},
Repeat{Symbol::terminal(11)},
})},
},
{}, {}, {}, {}
}
};
auto result = expand_repeats(grammar);
@ -142,12 +134,10 @@ describe("expand_repeats", []() {
});
it("can replace repeats in multiple rules", [&]() {
InitialSyntaxGrammar grammar{
{
Variable{"rule0", VariableTypeNamed, Repeat{Symbol::terminal(10)}},
Variable{"rule1", VariableTypeNamed, Repeat{Symbol::terminal(11)}},
},
{}, {}, {}, {}
InitialSyntaxGrammar grammar;
grammar.variables = {
Variable{"rule0", VariableTypeNamed, Repeat{Symbol::terminal(10)}},
Variable{"rule1", VariableTypeNamed, Repeat{Symbol::terminal(11)}},
};
auto result = expand_repeats(grammar);

View file

@ -11,13 +11,11 @@ using prepare_grammar::intern_symbols;
describe("intern_symbols", []() {
it("replaces named symbols with numerically-indexed symbols", [&]() {
InputGrammar grammar{
{
{"x", VariableTypeNamed, Rule::choice({ NamedSymbol{"y"}, NamedSymbol{"_z"} })},
{"y", VariableTypeNamed, NamedSymbol{"_z"}},
{"_z", VariableTypeNamed, String{"stuff"}}
},
{}, {}, {}, {}
InputGrammar grammar;
grammar.variables = {
{"x", VariableTypeNamed, Rule::choice({ NamedSymbol{"y"}, NamedSymbol{"_z"} })},
{"y", VariableTypeNamed, NamedSymbol{"_z"}},
{"_z", VariableTypeNamed, String{"stuff"}}
};
auto result = intern_symbols(grammar);
@ -32,11 +30,9 @@ describe("intern_symbols", []() {
describe("when there are symbols that reference undefined rules", [&]() {
it("returns an error", []() {
InputGrammar grammar{
{
{"x", VariableTypeNamed, NamedSymbol{"y"}},
},
{}, {}, {}, {}
InputGrammar grammar;
grammar.variables = {
{"x", VariableTypeNamed, NamedSymbol{"y"}},
};
auto result = intern_symbols(grammar);
@ -46,16 +42,14 @@ describe("intern_symbols", []() {
});
it("translates the grammar's optional 'extra_tokens' to numerical symbols", [&]() {
InputGrammar grammar{
{
{"x", VariableTypeNamed, Rule::choice({ NamedSymbol{"y"}, NamedSymbol{"z"} })},
{"y", VariableTypeNamed, NamedSymbol{"z"}},
{"z", VariableTypeNamed, String{"stuff"}}
},
{
NamedSymbol{"z"}
},
{}, {}, {}
InputGrammar grammar;
grammar.variables = {
{"x", VariableTypeNamed, Rule::choice({ NamedSymbol{"y"}, NamedSymbol{"z"} })},
{"y", VariableTypeNamed, NamedSymbol{"z"}},
{"z", VariableTypeNamed, String{"stuff"}}
};
grammar.extra_tokens = {
NamedSymbol{"z"}
};
auto result = intern_symbols(grammar);
@ -66,19 +60,15 @@ describe("intern_symbols", []() {
});
it("records any rule names that match external token names", [&]() {
InputGrammar grammar{
{
{"x", VariableTypeNamed, Rule::choice({ NamedSymbol{"y"}, NamedSymbol{"z"} })},
{"y", VariableTypeNamed, NamedSymbol{"z"}},
{"z", VariableTypeNamed, String{"stuff"}},
},
{},
{},
{
NamedSymbol{"w"},
NamedSymbol{"z"},
},
{}
InputGrammar grammar;
grammar.variables = {
{"x", VariableTypeNamed, Rule::choice({ NamedSymbol{"y"}, NamedSymbol{"z"} })},
{"y", VariableTypeNamed, NamedSymbol{"z"}},
{"z", VariableTypeNamed, String{"stuff"}},
};
grammar.external_tokens = {
NamedSymbol{"w"},
NamedSymbol{"z"},
};
auto result = intern_symbols(grammar);