From e3e7c8ed9d5bd99d8f67049097400d6a4333e686 Mon Sep 17 00:00:00 2001
From: Max Brunsfeld <maxbrunsfeld@gmail.com>
Date: Sat, 24 Feb 2018 23:47:37 -0800
Subject: [PATCH] Start work on grammar writing doc

[ci skip]
---
 docs/creating-parsers.md | 69 ++++++++++++++++++++++++++++++++++++++++
 docs/index.md            |  4 +++
 2 files changed, 73 insertions(+)
 create mode 100644 docs/creating-parsers.md

diff --git a/docs/creating-parsers.md b/docs/creating-parsers.md
new file mode 100644
index 00000000..d29d83d1
--- /dev/null
+++ b/docs/creating-parsers.md
@@ -0,0 +1,69 @@
+# Creating parsers
+
+Developing Tree-sitter parsers can have a difficult learning curve, but once you get the hang of it, it can fun and even zen-like. This document should help you to build an effective mental model for parser development.
+
+## Introduction
+
+Writing a grammar requires creativity. There are an infinite number of context-free grammars that can be used to describe any given language. In order to create a good Tree-sitter parser, you need to create a grammar with two important properties:
+
+  1. **An intuitive structure** - Tree-sitter's output is a [concrete syntax tree][cst]; each node in the tree corresponds directly to a [terminal or non-terminal symbol][non-terminal] in the grammar. So in order to produce an easy-to-analyze tree, there should be a direct correspondence between the symbols in your grammar and the recognizable constructs in the language. This might seem obvious, but it is very different from the way that context-free grammars are often written in contexts like [language specifications][language-spec] or [Yacc][yacc] parsers.
+
+  2. **A close adherence to LR** - Tree-sitter is based on the [GLR parsing][glr-parsing] algorithm. This means that while it can handle any context-free grammar, it works most efficiently with a class of context-free grammars called [LR Grammars][lr-grammars]. In this respect, Tree-sitter's grammars are similar to (but less restrictive than) Yacc grammars, but very different from [ANTLR grammars][antlr], [Parsing Expression Grammars][peg], or the [ambiguous grammars][ambiguous-grammar] commonly used in language specifications.
+
+It's unlikely that you'll be able to satisfy these two properties by translating an existing context-free grammar directly into Tree-sitter's grammar format. There are a few kinds of adjustments that are often required. The following sections will explain these adjustments in more depth.
+
+## Producing an intuitive tree
+
+Imagine that you were just starting work on the [Tree-sitter JavaScript parser][tree-sitter-javascript]. You might try to directly mirror the structure use the [ECMAScript Language Spec][ecmascript-spec]. To illustrate the problem with this approach, consider the following line of code:
+
+```js
+return x + y;
+```
+
+According to the specification, this is a `ReturnStatement`, the string `x + y` is an `AdditiveExpression`, and `x` and `y` are both `IdentifierReferences`. The relationship between these constructs is captured by a complex series of production rules:  
+
+```
+ReturnStatement          ->  'return' Expression
+Expression               ->  AssignmentExpression
+AssignmentExpression     ->  ConditionalExpression
+ConditionalExpression    ->  LogicalORExpression
+LogicalORExpression      ->  LogicalANDExpression
+LogicalANDExpression     ->  BitwiseORExpression
+BitwiseORExpression      ->  BitwiseXORExpression
+BitwiseXORExpression     ->  BitwiseANDExpression
+BitwiseANDExpression     ->  EqualityExpression
+EqualityExpression       ->  RelationalExpression
+RelationalExpression     ->  ShiftExpression
+ShiftExpression          ->  AdditiveExpression
+AdditiveExpression       ->  MultiplicativeExpression
+MultiplicativeExpression ->  ExponentiationExpression
+ExponentiationExpression ->  UnaryExpression
+UnaryExpression          ->  UpdateExpression
+UpdateExpression         ->  LeftHandSideExpression
+LeftHandSideExpression   ->  NewExpression
+NewExpression            ->  MemberExpression
+MemberExpression         ->  PrimaryExpression
+PrimaryExpression        ->  IdentifierReference
+```
+
+The language spec encodes the 20 different precedence levels of JavaScript expressions using 20 different non-terminal symbols. If we were to create a concrete syntax tree representing this statement according to the language spec, it would have twenty levels of nesting, and it would contain nodes with names like `BitwiseXORExpression`, which are unrelated to the actual code.
+
+### Precedence Annotations
+
+Clearly, we need a different way of modeling JavaScript expressions.
+
+...
+
+## Dealing with LR conflicts
+
+[cst]: https://en.wikipedia.org/wiki/Parse_tree
+[non-terminal]: https://en.wikipedia.org/wiki/Terminal_and_nonterminal_symbols
+[language-spec]: https://en.wikipedia.org/wiki/Programming_language_specification
+[glr-parsing]: https://en.wikipedia.org/wiki/GLR_parser
+[lr-grammars]: https://en.wikipedia.org/wiki/LR_parser
+[yacc]: https://en.wikipedia.org/wiki/Yacc
+[antlr]: http://www.antlr.org/
+[peg]: https://en.wikipedia.org/wiki/Parsing_expression_grammar
+[ambiguous-grammar]: https://en.wikipedia.org/wiki/Ambiguous_grammar
+[tree-sitter-javascript]: https://github.com/tree-sitter/tree-sitter-javascript
+[ecmascript-spec]: https://www.ecma-international.org/ecma-262/6.0
diff --git a/docs/index.md b/docs/index.md
index 7ebc768b..7ad56def 100644
--- a/docs/index.md
+++ b/docs/index.md
@@ -4,3 +4,7 @@ Tree-sitter is a library for parsing source code. It aims to be:
 * **Dependency-free** and written in pure C so that it can be embedded in any application
 * **Fast** and incremental so that it can be used in a text editor
 * **Robust** enough to provide useful results even in the presence of syntax errors
+
+## Table of contents
+
+1. [Creating parsers](creating-parsers.md)