Tree-sitter vs. Language Servers

▲ Tree-sitter vs. Language Servers(lambdaland.org)

103 points by ashton314 3 hours ago | 31 comments

▲ thramp 43 minutes ago | parent | next [-]

(Hi, I’m on the rust-analyzer team, but I’ve been less active for reasons that are clear in my bio.)

> Language servers are powerful because they can hook into the language’s runtime and compiler toolchain to get semantically correct answers to user queries. For example, suppose you have two versions of a pop function, one imported from a stack library, and another from a heap library. If you use a tool like the dumb-jump package in Emacs and you use it to jump to the definition for a call to pop, it might get confused as to where to go because it’s not sure what module is in scope at the point. A language server, on the other hand, should have access to this information and would not get confused.

You are correct that a language server will generally provide correct navigation/autocomplete, but a language server doesn’t necessarily need to hook into an existing compiler: a language server might be a latency-sensitive re-implementation of an existing compiler toolchain (rust-analyzer is the one I’m most familiar with, but the recent crop of new language servers tend to take this direction if the language’s compiler isn’t query-oriented).

> It is possible to use the language server for syntax highlighting. I am not aware of any particularly strong reasons why one would want to (or not want to) do this.

Since I spend a lot of time writing Rust, I’ll use Rust as an example: you can highlight a binding if it’s mutable or style an enum/struct differently. It’s one of those small things that makes a big impact once you get used to it: editors without semantic syntax highlighting (as it is called in the LSP specification) feel like they’re naked to me.

	▲	ashton314 19 minutes ago \| parent [-]
		> you can highlight a binding if it’s mutable or style an enum/struct differently Wow! That is an incredibly good reason. Thank you very much for telling me something I didn’t know. :)

▲ KlayLay 2 hours ago | parent | prev | next [-]

Side note, but thanks for the note about not using AI to write your articles. I'm tired of looking for information online, finding an article that may answer it, and not being sure about the author's integrity (this is so rampant on Medium).

	▲	mediaman 19 minutes ago \| parent \| next [-]
		Yes - I've been thinking about why this is. I'm guessing part of it is that writing forces us to think. I often find when I write something that I haven't thought it out fully, and articulating it makes me see a logical failure in my thinking, and gives me the ability to work that out. So when we just have AI write it, it means we've avoided the thinking part, and so the written article will be much less useful to the reader because there's no actual distillation of thought. Using voice to article is a little better, and I do find that talking out a thought helps me see its problems, but writing it seems to do better. There's also the problem that while it's easy to detect AI writing, it's hard to tell the difference between someone who thought it out by talking and had AI write it versus someone who did little thinking and still had AI write it. So as soon you you smell the whiff of AI writing, the reasonable expectation is that there's less distillation of thought.
	▲	2 hours ago \| parent \| prev [-]
		[deleted]

▲ williamcotton 5 minutes ago | parent | prev | next [-]

Tree-sitter does incremental parsing which will speed up the Language Server having to otherwise re-parse an entire file.

▲ Fiveplus an hour ago | parent | prev | next [-]

>It is possible to use the language server for syntax highlighting. I am not aware of any particularly strong reasons why one would want to (or not want to) do this.

Hmm, the strong reason could be latency and layout stability. Tree-sitter parses on the main thread (or a close worker) typically in sub-ms timeframes, ensuring that syntax coloring is synchronous with keystrokes. LSP semantic tokens are asynchronous by design. If you rely solely on LSP for highlighting, you introduce a flash of unstyled content or color-shifting artifacts every time you type, because the round-trip to the server (even a local one) and the subsequent re-tokenization takes longer than the frame budget.

The ideal hygiene could be something like -> tree-sitter provides the high-speed lexical coloring (keywords, punctuation, basic structure) instantly and LSP paints the semantic modifiers (interfaces vs classes, mutable vs const) asynchronously like 200ms later. Relying on LSP for the base layer makes the editor feel sluggish.

	▲	mickeyp an hour ago \| parent [-]
		That's generally how it works in most editors that support both. Tree-sitter has okay error correction, and that along with speed (as you mentioned) and its flexible query language makes it a winner for people to quickly iterate on a working parser but also obviously integration into an actual editor. Oh, and some LSPs use tree-sitter to parse.

▲ tetris11 2 hours ago | parent | prev | next [-]

I love tree-sitter+eglot but a few of the languages/schemes I work in, simply don't have parsers:

    > pacman -Ssq tree-sitter
    tree-sitter
    tree-sitter-bash
    tree-sitter-c
    tree-sitter-cli
    tree-sitter-javascript
    tree-sitter-lua
    tree-sitter-markdown
    tree-sitter-python
    tree-sitter-query
    tree-sitter-rust
    tree-sitter-vim
    tree-sitter-vimdoc

Where's R, YAML, Golang, and several others?

▲

jasonjmcghee 2 hours ago | parent | next [-]

Most of them are in the language pack (https://github.com/Goldziher/tree-sitter-language-pack)

For others, this is a sub optimal answer, but I’ve played with generating grammars with latest llms and they are surprisingly good at doing this (in a few shots).

That being said, if you’re doing something more serious than syntax highlighting or shipping it in a product, you’ll want to spend more time on it.

▲

johanvts 2 hours ago | parent | prev | next [-]

Go is here: https://github.com/tree-sitter/tree-sitter-go Try google, the others are probably out there as well.

▲

zokier 2 hours ago | parent | prev | next [-]

https://tree-sitter.github.io/tree-sitter/#parsers

https://github.com/tree-sitter/tree-sitter/wiki/List-of-pars...

▲

matthew-craig 2 hours ago | parent | prev | next [-]

In my emacs configuration, I have the following parsers installed:

awk bash bibtex blueprint c c-sharp clojure cmake commonlisp cpp css dart dockerfile elixir glsl gleam go gomod heex html janet java javascript json julia kotlin latex lua magik make markdown nix nu org perl proto python r ruby rust scala sql surface toml tsx typescript typst verilog vhdl vue wast wat wgsl yaml

▲

codethief 2 hours ago | parent | prev | next [-]

Uhh… The fact that there's no Archlinux package for a given language doesn't imply there's no tree-sitter support (official or 3rd-party) for that language? See e.g. the very long list of languages on https://github.com/Goldziher/tree-sitter-language-pack , which does include R, YAML, Golang, and many more.

▲

woodruffw 2 hours ago | parent | prev | next [-]

tree-sitter-yaml definitely exists[1]. Presumably nobody has packaged it for Arch yet; that seems like a thing you could contribute.

[1]: https://github.com/tree-sitter-grammars/tree-sitter-yaml

	▲	_ache_ 2 hours ago \| parent [-]
		It's in the AUR (aur/tree-sitter-yaml), a community-driven repository of Arch Linux packages. Not yet official. Since it comes from `tree-sitter-grammars/tree-sitter-yaml`, it may be quick to integrate the official repo.

▲

taeric 2 hours ago | parent | prev [-]

Odd, yaml-ts-mode exists? Did they change how it gets its parser?

▲ FjordWarden 2 hours ago | parent | prev | next [-]

This is like the difference between an orange and fruit juice. You can squeeze an orange to extract its juices, but that is not the only thing you can do with it, nor is it the only way to make fruit juice.

I use tree-sitter for developing a custom programming language, you still need an extra step to get from CST to AST, but the overall DevEx is much quicker that hand-rolling the parser.

▲

danielvaughn 2 hours ago | parent | next [-]

Every time I get to sing Treesitters praise, I take the opportunity to. I love it so much. I've tried a bunch of parser generators, and the TS approach is so simple and so good that I'll probably never use anything else. The iteration speed lets me get into a zen-like state where I just think about syntax design, and I don't sweat the technical bits.

▲

lioeters an hour ago | parent | prev | next [-]

> extra step to get from CST to AST

Could you elaborate on what this involves? I'm also looking at using tree-sitter as a parser for a new language, possibly to support multiple syntaxes. I'm thinking of converting its parse trees to a common schema, that's the target language.

I guess I don't quite get the difference between a concrete and abstract syntax tree. Is it just that the former includes information that's irrelevant to the semantics of the language, like whitespace?

	▲	FjordWarden an hour ago \| parent \| next [-]
		TS returns a tree with nodes, you walk the nodes with a visitor pattern. I've experimented with using tree-sitter queries for this, but for now not found this to be easier. Every syntax will have its own CST but it can target a general AST if you will. At the end they can both be represented as s-expressions and but you need rules to go from one flavour of syntax tree to the other. AST is just CST minus range info and simplified/generalised lexical info (in most cases).
	▲	direwolf20 an hour ago \| parent \| prev [-]
		That's correct.

▲

mattnewport an hour ago | parent | prev | next [-]

Yeah, you can even use tree-sitter to implement a language server, I've done this for a custom scripting language we use at work.

▲

lowbloodsugar an hour ago | parent | prev [-]

N00b question: Language parsers gives me concrete information, like “com.foo.bar.Baz is defined here”. Does tree sitter do that or does it say “this file has a symbol declaration for Baz” and elsewhere for that file “there is a package statement for ‘com.foo.bar’” and then I have to figure that out?

	▲	FjordWarden an hour ago \| parent [-]
		You have to figure this out for yourself in most cases. Tree sitter does have a query language based on s-expressions, but it is more for questions like "give me all the nodes that are literals", and then you can, for example, render those with in single draw call. Tree sitter has incremental parsing, and queries can be fixed at a certain byte range.

▲ jbreckmckye 29 minutes ago | parent | prev | next [-]

I'm doing a project with tree sitter right now

Any tips for keeping the grammar sizes under control? I'm distributing a CLI tool that needs to support several languages, and I can see the grammars gradually bloating the binary size

I could build some clever thing where language packs are opt-in and distributed as WASM, maybe. But that could be complex

▲ mellery451 an hour ago | parent | prev | next [-]

one topic not mentioned is creating refactoring tools. My sense is that LSPs generally have the advantage here because they have the full parsed tree, but I suspect it would be possible to build simple syntactic refactorings in TS with the potential to be both faster and less sensitive to broken syntax.

▲ briaoeuidhtns 2 hours ago | parent | prev | next [-]

I think the big reason to put syntax highlighting in the language server is you have more info, ex you can highlight symbols imported from a different file in one color for integers and a different for functions

	▲	LoganDark an hour ago \| parent [-]
		You can enrich highlighting using information from the language server, can't you? I think JetBrains does this

▲ mickeyp 2 hours ago | parent | prev | next [-]

Tree-sitter is great. It powers Combobulate in Emacs. Structured editing and movement would not have been easily done without it.

▲ vivzkestrel an hour ago | parent | prev [-]

- as a guy who is absolutely not familiar with the idea of how code editors work and has to make a browser based code editor, what are the things that you think I should know?

- i got a hint of language server and tree sitter thanks to this wonderfully written post but it is still missing a lot of details like how does the protocol actually look like, what does a standard language server or tree sitter implementation looks like

- what are the other building blocks?