The `Emoji` property alias is already present, but the actual property
is not available since it lives in a new file. This adds that file to
the `generate-unicode-categories-json`.
The `emoji-data` file follows the same format as the ones we already
consume in `generate-unicode-categories-json`, so adding emoji support
is fairly easy. his, grammars would need to hard-code a set of
unicode ranges in their own regex. The Javascript library `emoji-regex`
cannot be used because of #451.
For unclear reasons, the characters #, *, and 0-9 are marked as
`Emoji=Yes` by `emoji-data.txt`. Because of this, a grammar that wishes
to use emojis is likely to want to exclude those characters. For that
reason, this change also adds support for binary operations in regexes,
e.g. `[\p{Emoji}&&[^#*0-9]]`.
Lastly (and perhaps controversially), this change introduces new
variables available at grammar compile time, for the major, minor, and
patch versions of the tree-sitter CLI used to compile the grammar. This
will allow grammars to conditionally adopt these new regex features
while remaining backward compatible with older versions of the CLI.
Without this part of the change, grammar authors who do not precompile
and check-in their `grammar.json` would need to wait for downstream
systems to adopt a newer tree-sitter CLI version before they could begin
to use these features.
For some ABI changes, we may need to make changes to the parser.h in order
to restore a previous binary format, but for the current range of supported
ABI versions (13 + 14), the current parser.h is fine.
Refs #1599
Due to an oversight in #1589, I added `primary_field_ids` into the
`TSLanguage` struct in a place that wasn't the end. This is not actually
backwards compatible and causes downstream failures :(
This change exposes a new `primary_state_ids` field on the `TSLanguage`
struct, and populates it by tracking the first encountered state with a
given `core_id`. (For posterity: the initial change just exposed
`core_id` and deduplicated within `ts_analyze_query`).
With this `primary_state_ids` field in place, the
`ts_query__analyze_patterns` function only needs to populate its
subgraphs with starting states that are _primary_, since non-primary
states behave identically to primary ones. This leads to large savings
across the board, since most states are not primary.
These tests are easier to write and maintain if the grammars are just JS,
like grammars normally are. It doesn't slow the tests down significantly
to shell out to `node` for each of these grammars.
This patch updates the CLI to use anyhow and thiserror for error
management. The main feature that our custom `Error` type was providing
was a _list_ of messages, which would allow us to annotate "lower-level"
errors with more contextual information. This is exactly what's
provided by anyhow's `Context` trait.
(This is setup work for a future PR that will pull the `config` and
`loader` modules out into separate crates; by using `anyhow` we wouldn't
have to deal with a circular dependency between with the new crates.)
This is a follow-up to my previous commit 1badd131f9 .
I've made this an extra patch as it requires a minor
API change in <tree_sitter/parser.h>.
This commit moves the remaining generated tables into
the read-only segment.
Before:
$ for f in bash c cpp go html java javascript jsdoc json php python ruby rust; do \
gcc -o $f.o -O2 -Ilib/include -c test/fixtures/grammars/$f/src/parser.c; \
done
$ size --totals *.o
text data bss dec hex filename
5353477 24472 0 5377949 520f9d (TOTALS)
After:
$ for f in bash c cpp go html java javascript jsdoc json php python ruby rust; do \
gcc -o $f.o -O2 -Ilib/include -c test/fixtures/grammars/$f/src/parser.c; \
done
$ size --totals *.o
5378147 0 0 5378147 521063 (TOTALS)
This moves most of the generated tables from the data segment into
the text segment (read-only memory) so that it can be shared between
different processes.
As a bonus side effect we can also remove all casts in the generated parsers.
Before:
size --totals target/scratch/*.so
text data bss dec hex filename
853623 4684560 2160 5540343 5489f7 (TOTALS)
After:
size --totals target/scratch/*.so
text data bss dec hex filename
5472086 68616 480 5541182 548d3e (TOTALS)
tree-sitter 0.19.0 bumped the language version from 12 to 13. `npm install tree-sitter-cli` gets a recent version of tree-sitter, which generates languages with language version 13. However, the Cargo.toml generated from `tree-sitter generate` still has a an old tree-sitter as a dependency. This causes the rust bindings to not work out of the box, as the tree-sitter library expects language version 12.
It would be nice to add a test for this in CI. `tree-sitter generate` already creates a test for the rust binding, and that test fails out of the box due to the language mismatch.
No need to restrict it to char sets used in multiple places.
This is important because the helper functions are now implemented
more efficiently than the inline comparisons (using a binary search).