lib: remove utf8proc dependency (#436)

* Remove dependency on utf8proc

This removes the only external dependency on utf8proc for UTF-8 decoding. It does so by implementing its own UTF-8 decoder. This decoder is both faster and has a simpler API.

 * .gitmodules: remove utf8proc submodule
 * docs/section-2-using-parsers.md: remove requirement for utf8proc submodule
 * docs/section-6-contributing.md: likewise
 * lib/Cargo.toml: remove utf8proc subdirectory package include
 * lib/README.md: remove utf8proc subdirectory description
 * lib/binding_rust/build.rs: remove utf8proc compiler include directory
 * lib/src/lexer.c: remove utf8proc dependencies and types
 * lib/src/lib.c: remove utf8proc dependency
 * lib/src/unicode.h: define types for Unicode decoders
 * lib/src/utf16.{c,h}: implement more readable UTF-16 decoder
 * lib/src/utf8.{c,h}: implement fast UTF-8 decoder
 * lib/utf8proc: remove utf8proc submodule directory
 * script/build-lib: remove utf8proc compiler include directory
 * script/build-wasm: likewise

* Optimize ts_lexer__get_lookahead.

Try to favor non-failure code path and assign lookahead values directly to lexer

 * lib/src/lexer.c: optimize for non-failure code path

* Fix some compiler errors

 * lib/src/lexer.c: cast from signed to unsigned for decode_next result
 * lib/src/utf16.c: fix non-constant initializers for older compilers

* Remove some missed remnants of utf8proc

 * docs/section-2-using-parsers.md: only two include paths necessary now
 * lib/src/lib.c: no need to define UTF8PROC_STATIC

* Use ICU's utf8 and utf16 decoding routines

* Remove unnecessary casts when calling icu macros

* Check buffer length before attempting to decode a unicode character

* Use new unicode function when parsing Queries

Co-Authored-By: Matthew Krupcale <mkrupcale@matthewkrupcale.com>

* Mark libicu files as vendored for GitHub's stats
This commit is contained in:
Matthew Krupcale 2019-10-14 14:18:39 -04:00 committed by Max Brunsfeld
parent dc7997fdbb
commit ee9a3c0ebb
25 changed files with 2585 additions and 104 deletions

View file

@ -4,7 +4,6 @@
#include "./bits.h"
#include "./point.h"
#include "./tree_cursor.h"
#include "utf8proc.h"
#include <wctype.h>
/*
@ -149,16 +148,22 @@ static const uint16_t MAX_STATE_COUNT = 32;
// Advance to the next unicode code point in the stream.
static bool stream_advance(Stream *self) {
if (self->input >= self->end) return false;
self->input += self->next_size;
int size = utf8proc_iterate(
(const uint8_t *)self->input,
self->end - self->input,
&self->next
);
if (size <= 0) return false;
self->next_size = size;
return true;
if (self->input < self->end) {
uint32_t size = ts_decode_utf8(
(const uint8_t *)self->input,
self->end - self->input,
&self->next
);
if (size > 0) {
self->next_size = size;
return true;
}
} else {
self->next_size = 0;
self->next = '\0';
}
return false;
}
// Reset the stream to the given input position, represented as a pointer