Skip to content

Tree-sitter 1.0 Checklist #930

@maxbrunsfeld

Description

@maxbrunsfeld

In the not-too-distant future, I'd like to bump Tree-sitter's version to 1.0, indicating a greater degree of stability and completeness. After that I'd like to regenerate all of the parsers in the tree-sitter github org, and bump them to 1.0 as well. Before doing this, there are several important problems with the framework that I think should be fixed.

Tasks

  • Unicode character properties - Support ECMAScript unicode property escapes in regexes.

  • Partial Precedence Orderings - The integer precedence system makes some grammars shockingly difficult to maintain.

    • Enhance the precedence system to allow precedences to be expressed in a pairwise partial ordering instead of requiring a total ordering based on integers. (Allow precedences to be specified using strings and a partial ordering relation #939)
    • Update tree-sitter-javascript and tree-sitter-typescript to use this more flexible precedence scheme. Right now, the integer precedence system is making it very difficult to continue development of tree-sitter-typescript in particular, because of the mix of different conflicts between types and expressions.
    • Dynamic precedence should probably stay integer-only, for simplicity
  • Grammars with many fields, aliases - By historical accident, generated parsers use too small an integer type (uint8_t) for storing nodes' field and alias information. Parsers with large numbers of fields can cause integer overflows (Tree-sitter generates invalid code for grammars with large numbers of fields and/or aliases #511)

    • Start representing nodes' production_id as a uint16_t (Clean up parse table representation, use 16 bits for production_id #943)
    • Strategy - Decide whether we're going to bother to maintain backward compatibility with old generated parsers, if so, the library code will need to become a bit more complicated in order to consume both binary formats.
    • Grammars - Regenerate all the parsers with the new representation.
  • Fix issues with the get_column external scanner API (Fix the behavior of Lexer.get_column #978)

  • CLI Ergonomics

  • Mergeable Git Repos - Make it easier to collaborate on grammars by removing generated files from version control.

  • Documentation

    • Document the ability to match against supertypes in queries with the expression/identifier syntax.
    • Add more thorough explanations of LR conflicts, precedence, and dynamic conflict-resolution with GLR.
    • Make it clear how to use Tree-sitter for basic syntax highlighting without the tree-sitter-highlight rust crate (just using tree queries directly).
    • Document the tags.scm queries used for code navigation on GitHub. Document queries/tags.scm #660
    • Create a CHANGELOG file and start maintaining it. Wish: CHANGELOG #527

Stretch Goals

I'm recording these here even though they are a bit less urgent.

  • Incremental Parsing Perf - Enhance the external scanner API to allow for looser state comparisons, avoiding the catastrophic node-reuse failures seen in the HTML parser (Incremental parsing is ineffective when a new tag is opened tree-sitter-html#23)

    • Figure out if the new scanner function can be made optional (with the parser generator inspecting scanner.c to decide whether to link against a _compare function).
    • Update tree-sitter-html to use this API, improving its incremental performance
  • Native Library, WASM parsers - Add a compile-time option to link the C library against a standard WASM engine (V8, wasmtime, or wasmer). When this feature is enabled, allow the native library to load WASM parsers, marshaling the parse table into native memory, and using WASM execution only for the lexing phase. This will make it more useful to distribute parsers as pre-compiled .wasm files, instead of as C code. The performance cost should be small, because all of the expensive parsing operations will still be native. Add optional WASM feature to the native library, allowing it to run wasm-compiled parsers via wasmtime #1864

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    Projects

    No projects

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions