diff --git a/docs/assets/css/style.scss b/docs/assets/css/style.scss index 2fe29224..75ec2f4f 100644 --- a/docs/assets/css/style.scss +++ b/docs/assets/css/style.scss @@ -8,7 +8,8 @@ } #table-of-contents { - border-right: 1px solid #ddd; + border-right: 1px solid #ccc; + border-bottom: 1px solid #ccc; } .nav-link.active { @@ -21,7 +22,7 @@ display: block; } -.toc-section, .logo { +.toc-section:not(:last-child), .logo { border-bottom: 1px solid #ccc; } diff --git a/docs/index.md b/docs/index.md index d11d34dd..9e8a720c 100644 --- a/docs/index.md +++ b/docs/index.md @@ -4,11 +4,11 @@ title: Introduction # Introduction -Tree-sitter is a library for parsing source code. It aims to be: +Tree-sitter is an incremental parsing library. It can be used to build a concrete syntax tree for a source file and to efficiently update the syntax tree as the source file is edited. Tree-sitter aims to be: -* **Fast** and incremental so that it can be used in a text editor -* **Robust** enough to provide useful results even in the presence of syntax errors * **General** enough to parse any programming language +* **Fast** enough to parse on every keystroke in a text editor +* **Robust** enough to provide useful results even in the presence of syntax errors, * **Dependency-free** (and written in pure C) so that it can be embedded in any application ### Language Bindings @@ -48,3 +48,14 @@ There are parsers in development for these languages: * [FOSDEM 2018](https://www.youtube.com/watch?v=0CGzC_iss-8) * [GitHub Universe 2017](https://www.youtube.com/watch?v=a1rC79DHpmY) + +### Underlying Research + +The design of Tree-sitter was greatly influenced by the following research papers: + +- [Practical Algorithms for Incremental Software Development Environments](https://www2.eecs.berkeley.edu/Pubs/TechRpts/1997/CSD-97-946.pdf) +- [Context Aware Scanning for Parsing Extensible Languages](http://www.umsec.umn.edu/publications/Context-Aware-Scanning-Parsing-Extensible) +- [Efficient and Flexible Incremental Parsing](http://ftp.cs.berkeley.edu/sggs/toplas-parsing.ps) +- [Incremental Analysis of Real Programming Languages](https://pdfs.semanticscholar.org/ca69/018c29cc415820ed207d7e1d391e2da1656f.pdf) +- [Error Detection and Recovery in LR Parsers](http://what-when-how.com/compiler-writing/bottom-up-parsing-compiler-writing-part-13) +- [Error Recovery for LR Parsers](http://www.dtic.mil/dtic/tr/fulltext/u2/a043470.pdf) diff --git a/docs/section-2-architecture.md b/docs/section-2-architecture.md index ad007cce..a1101d44 100644 --- a/docs/section-2-architecture.md +++ b/docs/section-2-architecture.md @@ -9,10 +9,16 @@ Tree-sitter consists of two separate libraries, both of which expose C APIs. The first library, `libcompiler`, is used to generate a parser for a language by supplying a [context-free grammar](https://en.wikipedia.org/wiki/Context-free_grammar) describing the -language. `libcompiler` is a build tool; once the parser has been generated, it is no longer needed. Its public interface is specified in the header file [`compiler.h`](https://github.com/tree-sitter/tree-sitter/blob/master/include/tree_sitter/compiler.h). +language. `libcompiler` is a build tool; it is no longer needed once a parser has been generated. Its public interface is specified in the header file [`compiler.h`](https://github.com/tree-sitter/tree-sitter/blob/master/include/tree_sitter/compiler.h). The second library, `libruntime`, is used in combination with the parsers generated by `libcompiler`, to produce syntax trees from source code and keep the syntax trees up-to-date as the source code changes. `libruntime` is designed to be embedded in applications. Its interface is specified in the header file [`runtime.h`](https://github.com/tree-sitter/tree-sitter/blob/master/include/tree_sitter/runtime.h). ## The Compiler + +WIP + +## The Runtime + +WIP diff --git a/docs/section-3-creating-parsers.md b/docs/section-3-creating-parsers.md index 1de02833..5d09670e 100644 --- a/docs/section-3-creating-parsers.md +++ b/docs/section-3-creating-parsers.md @@ -357,6 +357,8 @@ You may have noticed in the above examples that some of the grammar rule name li ## Dealing with LR conflicts +TODO + [cst]: https://en.wikipedia.org/wiki/Parse_tree [non-terminal]: https://en.wikipedia.org/wiki/Terminal_and_nonterminal_symbols [language-spec]: https://en.wikipedia.org/wiki/Programming_language_specification diff --git a/docs/section-4-using-parsers.md b/docs/section-4-using-parsers.md index 903b4527..054e0e2e 100644 --- a/docs/section-4-using-parsers.md +++ b/docs/section-4-using-parsers.md @@ -5,4 +5,77 @@ permalink: using-parsers # Using Parsers -WIP +A Tree-sitter parser consists of a single C source file which exports one function with the naming scheme `tree_sitter_${LANGUAGE_NAME}`. This function returns a pointer to a `TSLanguage` struct, which can be used in conjunction with a `TSParser` to produce a syntax trees. + +## The Raw C API + +Here's an example of a simple C program that uses the Tree-sitter [JSON parser](https://github.com/tree-sitter/tree-sitter-json). + +```c +// Filename - test-json-parser.c + +#include +#include +#include +#include "tree_sitter/runtime.h" + +TSLanguage *tree_sitter_json(); + +int main() { + // Create a parser with the JSON language. + TSParser *parser = ts_parser_new(); + ts_parser_set_language(parser, tree_sitter_json()); + + // Parse some source code. + const char *source_code = "[1, null]"; + TSTree *tree = ts_parser_parse_string(parser, NULL, source_code, strlen(source_code)); + + // Find some syntax tree nodes. + TSNode root_node = ts_tree_root_node(tree); + TSNode array_node = ts_node_named_child(root_node, 0); + TSNode number_node = ts_node_named_child(array_node, 0); + + // Check that the nodes have the expected types. + assert(!strcmp(ts_node_type(root_node), "value")); + assert(!strcmp(ts_node_type(array_node), "array")); + assert(!strcmp(ts_node_type(number_node), "number")); + + // Check that the nodes have the expected child counts. + assert(ts_node_child_count(root_node) == 1); + assert(ts_node_child_count(array_node) == 4); + assert(ts_node_named_child_count(array_node) == 2); + assert(ts_node_child_count(number_node) == 0); + + // Print the syntax tree as an S-expression. + char *string = ts_node_string(root_node); + printf("Syntax tree: %s\n", string); + + // Free all of the heap allocations. + free(string); + ts_tree_delete(tree); + ts_parser_delete(parser); + return 0; +} +``` + +This program uses the Tree-sitter C API, which is declared in the header file `tree_sitter/runtime.h`, so we need to add the `tree_sitter/include` directory to the include path. We also need to link `libruntime.a` into the binary. + +```sh +clang \ + -I tree-sitter/include \ + test-json-parser.c \ + tree-sitter-json/src/parser.c \ + tree-sitter/out/Release/libruntime.a \ + -o test-json-parser + +./test-json-parser +``` + +### Providing the text to parse + +Text input is provided to a tree-sitter parser via a `TSInput` struct, which contains function pointers for seeking to positions in the text, and for reading chunks of text. The text can be encoded in either UTF8 or UTF16. This interface allows you to efficiently parse text that is stored in your own data structure. + +### Querying the syntax tree + +Tree-sitter provides a DOM-style interface for inspecting syntax trees. Functions like `ts_node_child(node, index)` and `ts_node_next_sibling(node)` expose every node in the concrete syntax tree. This is useful for operations like syntax-highlighting, which operate on a token-by-token basis. You can also traverse the tree in a more abstract way by using functions like +`ts_node_named_child(node, index)` and `ts_node_next_named_sibling(node)`. These functions don't expose nodes that were specified in the grammar as anonymous tokens, like `:` and `{`. This is useful when analyzing the meaning of a document.