-
-
Notifications
You must be signed in to change notification settings - Fork 2k
Description
In the not-too-distant future, I'd like to bump Tree-sitter's version to 1.0, indicating a greater degree of stability and completeness. After that I'd like to regenerate all of the parsers in the tree-sitter github org, and bump them to 1.0 as well. Before doing this, there are several important problems with the framework that I think should be fixed.
Tasks
-
Unicode character properties - Support ECMAScript unicode property escapes in regexes.
- Implement basic support for this construct (Handle simple unicode property escapes in regexes #906)
- Regenerate all parsers to use unicode property escapes, fix any bugs that surface
-
Partial Precedence Orderings - The integer precedence system makes some grammars shockingly difficult to maintain.
- Enhance the precedence system to allow precedences to be expressed in a pairwise partial ordering instead of requiring a total ordering based on integers. (Allow precedences to be specified using strings and a partial ordering relation #939)
- Update
tree-sitter-javascript
andtree-sitter-typescript
to use this more flexible precedence scheme. Right now, the integer precedence system is making it very difficult to continue development oftree-sitter-typescript
in particular, because of the mix of different conflicts between types and expressions. - Dynamic precedence should probably stay integer-only, for simplicity
-
Grammars with many fields, aliases - By historical accident, generated parsers use too small an integer type (
uint8_t
) for storing nodes' field and alias information. Parsers with large numbers of fields can cause integer overflows (Tree-sitter generates invalid code for grammars with large numbers of fields and/or aliases #511)- Start representing nodes'
production_id
as auint16_t
(Clean up parse table representation, use 16 bits for production_id #943) - Strategy - Decide whether we're going to bother to maintain backward compatibility with old generated parsers, if so, the library code will need to become a bit more complicated in order to consume both binary formats.
- Grammars - Regenerate all the parsers with the new representation.
- Start representing nodes'
-
Fix issues with the
get_column
external scanner API (Fix the behavior of Lexer.get_column #978) -
CLI Ergonomics
- Generate Rust bindings for parsers, and structure the Node.js bindings more consistently with the Rust ones (In the generate command, create rust binding files #948)
- In
parse
command, auto-detect UTF-16 files and decode them accordingly. This will help windows users who currently trip over the suggestedecho
command in the docs. (feat: add encoding flag and automatically check if a file might be utf16 #2368) - Support grammars defined as ECMAScript modules instead of CommonJS module.
- Reduce Coupling to Node - Introduce some Tree-sitter specific
GRAMMAR_PATH
setting where the CLI will search for grammar modules, instead of relying onnode_modules
andnpm
.
-
Mergeable Git Repos - Make it easier to collaborate on grammars by removing generated files from version control.
- CLI commands - Add new
pack
andpublish
subcommands to the Tree-sitter CLI, for uploading tarballs and compiled.wasm
files to the GitHub releases API. Store generated files as GH release artifacts instead of checking them into git repositories #730 (comment) - Cleanup - Remove generated files from all the grammar repos in the tree-sitter org
- CLI commands - Add new
-
Documentation
- Document the ability to match against supertypes in queries with the
expression/identifier
syntax. - Add more thorough explanations of LR conflicts, precedence, and dynamic conflict-resolution with GLR.
- Make it clear how to use Tree-sitter for basic syntax highlighting without the
tree-sitter-highlight
rust crate (just using tree queries directly). - Document the
tags.scm
queries used for code navigation on GitHub. Documentqueries/tags.scm
#660 - Create a CHANGELOG file and start maintaining it. Wish: CHANGELOG #527
- Document the ability to match against supertypes in queries with the
Stretch Goals
I'm recording these here even though they are a bit less urgent.
-
Incremental Parsing Perf - Enhance the external scanner API to allow for looser state comparisons, avoiding the catastrophic node-reuse failures seen in the HTML parser (Incremental parsing is ineffective when a new tag is opened tree-sitter-html#23)
- Figure out if the new scanner function can be made optional (with the parser generator inspecting
scanner.c
to decide whether to link against a_compare
function). - Update
tree-sitter-html
to use this API, improving its incremental performance
- Figure out if the new scanner function can be made optional (with the parser generator inspecting
-
Native Library, WASM parsers - Add a compile-time option to link the C library against a standard WASM engine (V8, wasmtime, or wasmer). When this feature is enabled, allow the native library to load WASM parsers, marshaling the parse table into native memory, and using WASM execution only for the lexing phase. This will make it more useful to distribute parsers as pre-compiled
.wasm
files, instead of as C code. The performance cost should be small, because all of the expensive parsing operations will still be native. Add optional WASM feature to the native library, allowing it to run wasm-compiled parsers via wasmtime #1864