tree-sitter/docs/creating-parsers.md
2018-02-26 11:19:45 -08:00

9.2 KiB

Creating parsers

Developing Tree-sitter parsers can have a difficult learning curve, but once you get the hang of it, it can fun and even zen-like. This document should help you to build an effective mental model for parser development.

Understanding the problem

Writing a grammar requires creativity. There are an infinite number of CFGs (context-free grammars) that can be used to describe any given language. In order to produce a good Tree-sitter parser, you need to create a grammar with two important properties:

  1. An intuitive structure - Tree-sitter's output is a concrete syntax tree; each node in the tree corresponds directly to a terminal or non-terminal symbol in the grammar. So in order to produce an easy-to-analyze tree, there should be a direct correspondence between the symbols in your grammar and the recognizable constructs in the language. This might seem obvious, but it is very different from the way that context-free grammars are often written in contexts like language specifications or Yacc parsers.

  2. A close adherence to LR(1) - Tree-sitter is based on the GLR parsing algorithm. This means that while it can handle any context-free grammar, it works most efficiently with a class of context-free grammars called LR(1) Grammars. In this respect, Tree-sitter's grammars are similar to (but less restrictive than) Yacc grammars, but different from ANTLR grammars, Parsing Expression Grammars, or the ambiguous grammars commonly used in language specifications.

It's unlikely that you'll be able to satisfy these two properties just by translating an existing context-free grammar directly into Tree-sitter's grammar format. There are a few kinds of adjustments that are often required. The following sections will explain these adjustments in more depth.

Installing the tools

The best way to create a Tree-sitter parser is with the Tree-sitter CLI, which is distributed as a Node.js module. To install it, first install node and its package manager npm on your system. Then create a new directory for your parser, with a package.json file inside the directory. Add tree-sitter-cli to the dependencies section of package.json and run the command npm install. This will install the CLI and its dependencies into the node_modules folder in your directory. An executable program called tree-sitter will be created at the path ./node_modules/.bin/tree-sitter. You may want to follow the Node.js convention of adding ./node_modules/.bin to your PATH so that you can easily run this program when working in this directory.

Once you have the CLI installed, create a file called grammar.js with the following skeleton:

module.exports = grammar({
  name: 'the_language_name',

  rules: {
    // the production rules of the context-free grammar
  }
});

Starting to define the grammar

It's usually a good idea to find a formal specification for the language you're trying to parse. This specification will most likely contain a context-free grammar. As you read through the rules of this CFG, you will probably discover a complex and cyclic graph of relationships. It might be unclear how you should navigate this graph as you define your grammar.

Although languages have very different constructs, their constructs can often be categorized in to similar groups like Declarations, Definitions, Statements, Expressions, Types, and Patterns. In writing your grammar, a good first step is to create just enough structure to include all of these basic groups of rules. For an imaginary C-like language, this might look something like this:

rules: $ => {
  source_file: $ => repeat($._definition),

  _definition: $ => choice(
    $.function_definition
    // TODO: other kinds of definitions
  ),

  function_definition: $ => seq(
    'func',
    $.identifier,
    $.parameter_list,
    $._type,
    $.block
  ),

  parameter_list: $ => seq(
    '(',
     // TODO: parameters
    ')'
  ),

  _type: $ => choice(
    'bool'
    // TODO: other kinds of types
  ),

  block: $ => seq(
    '{',
    repeat($._statement),
    '}'
  ),

  _statement: $ => choice(
    $.return_statement
    // TODO: other kinds of statements
  ),

  return_statement: $ => seq(
    'return',
    $._expression,
    ';'
  ),

  _expression: $ => choice(
    $.identifier,
    $.number
    // TODO: other kinds of expressions
  ),

  identifier: $ => /[a-z]+/,

  number: $ => /\d+/
}

Some of the details of this grammar will be explained in more depth later on, but if you focus on the TODO comments, you can see that the overall strategy is a breadth-first approach. With this structure in place, you can now freely decide what part of the grammar to flesh out next.

For example, you might decide to start with types. One-by-one, you could define the rules for writing basic types and composing them into more complex types:

_type: $ => choice(
  $.primitive_type,
  $.array_type,
  $.pointer_type
),

primitive_type: $ => choice(
  'bool',
  'int'
),

array_type: $ => seq(
  '[',
  ']',
  $._type
),

pointer_type: $ => seq(
  '*',
  $._type
),

Unit Tests

For each rule that you add to the grammar, you should first create a test that describes how the syntax trees should look when parsing that rule. These tests are written using specially-formatted text files in a corpus directory in your parser's root folder. Here is an example of how these tests should look:

==================
Return statements
==================

func x() int {
  return 1;
}

---

(source_file
  (function_definition
    (identifier)
    (parameter_list)
    (primitive_type)
    (block
      (return_statement (number))))

The name of the test is written between two lines containing only = characters. Then the source code is written, followed by a line containing three or more - characters. Then, the expected syntax tree is written as an S-expression. Note that the S-expression does not show syntax nodes like func, ( and ;, which are expressed as strings and regexps in the grammar. It only shows syntax nodes that have been given names.

Adjusting existing grammars to produce better trees

Imagine that you were just starting work on the Tree-sitter JavaScript parser. You might try to directly mirror the structure of the ECMAScript Language Spec. To illustrate the problem with this approach, consider the following line of code:

return x + y;

According to the specification, this line is a ReturnStatement, the fragment x + y is an AdditiveExpression, and x and y are both IdentifierReferences. The relationship between these constructs is captured by a complex series of production rules:

ReturnStatement          ->  'return' Expression
Expression               ->  AssignmentExpression
AssignmentExpression     ->  ConditionalExpression
ConditionalExpression    ->  LogicalORExpression
LogicalORExpression      ->  LogicalANDExpression
LogicalANDExpression     ->  BitwiseORExpression
BitwiseORExpression      ->  BitwiseXORExpression
BitwiseXORExpression     ->  BitwiseANDExpression
BitwiseANDExpression     ->  EqualityExpression
EqualityExpression       ->  RelationalExpression
RelationalExpression     ->  ShiftExpression
ShiftExpression          ->  AdditiveExpression
AdditiveExpression       ->  MultiplicativeExpression
MultiplicativeExpression ->  ExponentiationExpression
ExponentiationExpression ->  UnaryExpression
UnaryExpression          ->  UpdateExpression
UpdateExpression         ->  LeftHandSideExpression
LeftHandSideExpression   ->  NewExpression
NewExpression            ->  MemberExpression
MemberExpression         ->  PrimaryExpression
PrimaryExpression        ->  IdentifierReference

The language spec encodes the 20 precedence levels of JavaScript expressions using 20 different non-terminal symbols. If we were to create a concrete syntax tree representing this statement according to the language spec, it would have twenty levels of nesting and it would contain nodes with names like BitwiseXORExpression, which are unrelated to the actual code.

Precedence Annotations

Clearly, we need a different way of modeling JavaScript expressions.

...

Dealing with LR conflicts