From 03b776027597d2ab6cb292a0b689d337a2ce133c Mon Sep 17 00:00:00 2001 From: Amaan Qureshi Date: Tue, 24 Dec 2024 21:42:46 -0500 Subject: [PATCH] docs(scanner): add overview to the `scan` function Co-authored-by: David Baynard --- .../creating-parsers/4-external-scanners.md | 19 ++++++++++++++++--- 1 file changed, 16 insertions(+), 3 deletions(-) diff --git a/docs/src/creating-parsers/4-external-scanners.md b/docs/src/creating-parsers/4-external-scanners.md index 13e622e9..e1d9f9ad 100644 --- a/docs/src/creating-parsers/4-external-scanners.md +++ b/docs/src/creating-parsers/4-external-scanners.md @@ -68,7 +68,7 @@ void tree_sitter_my_language_external_scanner_destroy(void *payload) { This function should free any memory used by your scanner. It is called once when a parser is deleted or assigned a different language. It receives as an argument the same pointer that was returned from the _create_ function. If your _create_ function -didn't allocate any memory, this function can be a noop. +didn't allocate any memory, this function can be a no-op. ## Serialize @@ -110,6 +110,20 @@ their values from the byte buffer. ## Scan +Typically, one will + +- Call `lexer->advance` several times, if the characters are valid for the token being lexed. + +- Optionally, call `lexer->mark_end` to mark the end of the token, and "peek ahead" +to check if the next character (or set of characters) invalidates the token. + +- Set `lexer->result_symbol` to the token type. + +- Return `true` from the scanning function, indicating that a token was successfully lexed. + +Tree-sitter will then push resulting node to the parse stack, and the input position will remain where it reached at the +point `lexer->mark_end` was called. + ```c bool tree_sitter_my_language_external_scanner_scan( void *payload, @@ -120,8 +134,7 @@ bool tree_sitter_my_language_external_scanner_scan( } ``` -This function is responsible for recognizing external tokens. It should return `true` if a token was recognized, and `false` -otherwise. It is called with a "lexer" struct with the following fields: +The second parameter to this function is the lexer, of type `TSLexer`. The `TSLexer` struct has the following fields: - **`int32_t lookahead`** — The current next character in the input stream, represented as a 32-bit unicode code point.