Commit graph

619 commits

Author SHA1 Message Date
Max Brunsfeld
d4264d6191 Fix parsing of quantifiers with no upper bound 2018-08-06 13:47:26 -07:00
Max Brunsfeld
126f84aa73 Avoid unnecessary suffixes on external symbol identifiers 2018-08-01 16:11:21 -07:00
Max Brunsfeld
cb784975a4 Add IMMEDIATE_TOKEN rule type, for enforcing no preceding extras 2018-08-01 14:00:57 -07:00
Max Brunsfeld
e88dd223b2 Support {} quantifier syntax in regexes 2018-07-25 11:29:41 -07:00
Max Brunsfeld
e130c4ddb5 Add word property to grammar JSON schema 2018-06-15 13:32:41 -07:00
Max Brunsfeld
0dd41f0d74 Restore logic for restricting keyword tokens
Removing this restriction created problems for the Rust grammar, and
possibly others. The proper fix would be to ensure that the 'word
token' matches *every* possible string that a 'keyword token'
matches, as opposed to just matching *some* of the same strings.
This would require us to gather a little more information
about how tokens conflict. For now, I'm just going to put back the
hard-coded logic that we had.
2018-06-15 13:15:02 -07:00
Max Brunsfeld
c39f0e9ef9 Rename word_rule -> word_token 2018-06-15 09:15:12 -07:00
Max Brunsfeld
91e3bc3e55 Update parse state merging logic for explicit word tokens
Co-Authored-By: Ashi Krishnan <queerviolet@github.com>
2018-06-14 12:32:27 -07:00
Max Brunsfeld
190456d7ec Fix logging during lex table construction
Co-Authored-By: Ashi Krishnan <queerviolet@github.com>
2018-06-14 12:03:40 -07:00
Max Brunsfeld
6e72c2943d Avoid missing field initializer warnings w/o default field syntax
The default field syntax aint working on windows
2018-06-14 11:12:04 -07:00
Max Brunsfeld
e17cd42e47 Perform keyword optimization using explicitly selected word token
rather than trying to infer the word token automatically.

Co-Authored-By: Ashi Krishnan <queerviolet@github.com>
2018-06-14 09:35:54 -07:00
Max Brunsfeld
8120e61d8d Remove blank lines from log messages 2018-05-25 21:37:25 -07:00
Max Brunsfeld
45c52f9459 Allow keywords to contain numbers, as long as they start w/ a letter 2018-05-25 21:28:47 -07:00
Max Brunsfeld
356d5e0221 Generalize logic for finding a keyword capture token 2018-05-25 15:29:15 -07:00
Max Brunsfeld
406a85a166 Log when lexical conflicts prevents parse state merging 2018-05-24 16:46:43 -07:00
Max Brunsfeld
915978aa9d Avoid redundant logging of conflicting tokens 2018-05-24 16:22:16 -07:00
Max Brunsfeld
6fca8f2f4d Make ts_compile_grammar take an optional log file, start logging to it 2018-05-24 16:01:14 -07:00
Max Brunsfeld
6bb63f549f Fix unused lambda captures 2018-05-11 14:59:59 -07:00
Max Brunsfeld
e8cfb9ced0 Remove incorrect return statement
This prevented conflicts between some tokens from being recorded
properly. In the case of JavaScript, it prevented tree-sitter from
recognizing the conflict between the forward slash operator and the
regex token, allowing regexes to be merged into parse states containing
'/' incorrectly.

Refs tree-sitter/tree-sitter-javascript#71
2018-04-17 17:14:36 -07:00
Max Brunsfeld
379a2fd121 Incrementally build a tree of skipped tokens
Rather than pushing them to the stack individually
2018-04-09 12:29:22 -07:00
Max Brunsfeld
1109a565fc
Merge pull request #159 from Pike/test-escape-brackets
Tests for issue 158
2018-04-06 13:19:36 -07:00
Max Brunsfeld
1ca261c79b Fix some regex parsing bugs
* Allow escape sequences to be used in ranges
* Don't give special meaning to dashes outside of character classes
2018-04-06 12:46:06 -07:00
Max Brunsfeld
65e654ea9b Remove overly conservative check for the validity of keyword capture tokens 2018-04-05 13:25:16 -07:00
Pieter Goetschalckx
3d0ca31cf1 Add error message for TSCompileErrorTypeInvalidTokenContents 2018-03-30 21:12:09 +02:00
Max Brunsfeld
fb348c0f1e Fix signed/unsigned comparison warning 2018-03-28 11:04:49 -07:00
Max Brunsfeld
e917756ad1 Remove depends_on_lookahead field from parse table entries
This simplifies the logic for determining whether a token is reusable
and makes it more conservative. It should fix some incremental parsing
bugs that are being caught by the randomized tests on CI.
2018-03-28 10:58:33 -07:00
Max Brunsfeld
186f70649c Consolidate the unify for detecting conflicting tokens 2018-03-28 10:03:09 -07:00
Max Brunsfeld
a8bc67ac42 Allow LookaheadSet::for_each to terminate early 2018-03-28 10:03:09 -07:00
Max Brunsfeld
43e14332ed Avoid creating duplicate metadata rules 2018-03-28 10:03:09 -07:00
Max Brunsfeld
b7d0606fbd Be less conservative in merging parse states with external tokens
Also, clean up the internal representation of external tokens
2018-03-16 16:00:40 -07:00
Max Brunsfeld
7183f8d3e7 Fix unit reduction elimination bugs
* Handle 'chains' of unit reductions starting in a single state
* Avoid eliminating rules which will later receive aliases
2018-03-12 07:54:18 -07:00
Max Brunsfeld
72849787b1 Fix logic for identifying keyword capture token 2018-03-12 07:52:57 -07:00
Max Brunsfeld
128edbebd6 Eliminate non-user-visible unit reductions from parse tables 2018-03-08 12:53:32 -08:00
Max Brunsfeld
53cd89c614 Ensure keyword capture tokens aren't too loosely defined 2018-03-07 14:46:11 -08:00
Max Brunsfeld
c0cc35ff07 Create separate lexer function for keywords 2018-03-07 12:00:26 -08:00
Max Brunsfeld
52087de4f0 Remove the concept of fragile reductions
They were a vestige of when Tree-sitter did sentential form-based
incremental parsing (as opposed to simply state matching). This was
elegant but not compatible with GLR as far as I could tell.
2018-03-02 14:51:54 -08:00
Max Brunsfeld
32ef3e001a Account for epsilon external tokens when merging parse states
Do not merge a token T into a parse state S if S contains
external tokens that can be *followed* by tokens that could
be shadowed by T.

At this point, the only automated test for this logic is via
the bash grammar, in which the `]` token should not be merged
into states in which `_concat` is valid, because `_concat`
can be followed by a `_special_characters` token, and `]`
would shadow `_special_characters`.
2018-02-28 14:47:04 -08:00
Max Brunsfeld
10a3cbd814 Move grammar schema to src folder
Now that there's a docs folder that contains actual docs.
2018-02-26 00:40:20 -08:00
Max Brunsfeld
2daae48fe0 Handle conflicts in repeat rules after external tokens
Signed-off-by: Rick Winfrey <rewinfrey@github.com>
2018-02-14 11:24:51 -08:00
Max Brunsfeld
8c29841adf Represent repetitions with associative structure 2018-02-12 11:41:56 -08:00
Max Brunsfeld
2e4f76c164 Don't allow an epsilon start rule if it is used in other rules 2018-01-23 17:05:28 -08:00
Max Brunsfeld
532bbeca0d Remove wrong handling of \a in a regex 2017-12-12 16:50:53 -08:00
Max Brunsfeld
493db39363 Never move the start rule of a grammar into the lexical grammar
This preserves a useful invariant that the root node of the AST is never
a token.
2017-12-07 11:50:27 -08:00
Max Brunsfeld
ba607a1f84 Optimize lex state merging 2017-09-18 13:40:37 -07:00
Max Brunsfeld
b0fdc33f73 Remove 'extra' and 'structural' booleans from symbol metadata 2017-09-14 12:07:46 -07:00
Max Brunsfeld
91456d7a17 Avoid duplicate error state entries for tokens that are both internal & external 2017-09-14 10:54:13 -07:00
Max Brunsfeld
99d048e016 Simplify error recovery; eliminate recovery states
The previous approach to error recovery relied on special error-recovery
states in the parse table. For each token T, there was an error recovery
state in which the parser looked for *any* token that could follow T.
Unfortunately, sometimes the set of tokens that could follow T contained
conflicts. For example, in JS, the token '}' can be followed by the
open-ended 'template_chars' token, but also by ordinary tokens like
'identifier'. So with the old algorithm, when recovering from an
unexpected '}' token, the lexer had no way to distinguish identifiers
from template_chars.

This commit drops the error recovery states. Instead, when we encounter
an unexpected token T, we recover from the error by finding a previous
state S in the stack in which T would be valid, popping all of the nodes
after S, and wrapping them in an error.

This way, the lexer is always invoked in a normal parse state, in which
it is looking for a non-conflicting set of tokens. Eliminating the error
recovery states also shrinks the lex state machine significantly.

Signed-off-by: Rick Winfrey <rewinfrey@github.com>
2017-09-11 15:22:52 -07:00
Max Brunsfeld
4c9c05806a Merge compatible starting token states before constructing lex table 2017-09-05 13:21:53 -07:00
Max Brunsfeld
9d668c5004 Move incompatible token map into LexTableBuilder 2017-08-31 15:46:37 -07:00
Max Brunsfeld
f8649824fa Remove unused function 2017-08-31 15:30:44 -07:00