Remix.run Logo
gritzko 16 hours ago

At this point, the question is: why keep files as blobs in the first place. If a revision control system stores AST trees instead, all the work is AST-level. One can run SQL-level queries then to see what is changing where. Like

  - do any concurrent branches touch this function?
  - what new uses did this function accrete recently?
  - did we create any actual merge conflicts?
Almost LSP-level querying, involving versions and branches. Beagle is a revision control system like that [1]

It is quite early stage, but the surprising finding is: instead of being a depository of source code blobs, an SCM can be the hub of all activities. Beagle's architecture is extremely open in the assumption that a lot of things can be built on top of it. Essentially, it is a key-value db, keys are URIs and values are BASON (binary mergeable JSON) [2] Can't be more open than that.

[1]: https://github.com/gritzko/librdx/tree/master/be

[2]: https://github.com/gritzko/librdx/blob/master/be/STORE.md

rs545837 16 hours ago | parent | next [-]

This is the right question. Storing ASTs directly would make all of this native instead of layered on top.

The pragmatic reason weave works at the git layer: adoption. Getting people to switch merge drivers is hard enough, getting them to switch VCS is nearly impossible. So weave parses the three file versions on the fly during merge, extracts entities, resolves per-entity, and writes back a normal file that git stores as a blob. You get entity-level merging without anyone changing their workflow.

But you're pointing at the ceiling of that approach. A VCS that stores ASTs natively could answer "did any concurrent branches touch this function?" as a query, not as a computation. That's a fundamentally different capability. Beagle looks interesting, will dig into the BASON format.

We built something adjacent with sem (https://github.com/ataraxy-labs/sem) which extracts the entity dependency graph from git history. It can answer "what new uses did this function accrete" and "what's the blast radius of this change" but it's still a layer on top of git, not native storage.

yvdriess 10 hours ago | parent | next [-]

Everything that was old will become new again. Content/structural version control used to be a research field. Pharo still uses one afaik https://scg.unibe.ch/archive/papers/Nier13bMonticello.pdf

rs545837 3 hours ago | parent [-]

Yeah I am actually super excited for these new kind of innovations.

These things are genuinely not letting me sleep these days.

evrimoztamur 10 hours ago | parent | prev | next [-]

I would love it if sem was hooked up into a PR review UI...

rs545837 10 hours ago | parent | next [-]

That's exactly what inspect does (https://github.com/ataraxy-labs/inspect). It uses sem's entity graph to triage PRs: classifies every change, scores risk based on blast radius and dependents, and groups related changes together.

We are still benchmarking it on diverse benchmarks, and working on it as a research direction. Also working on a diff viewer in rust.

Palanikannan 10 hours ago | parent | prev [-]

You mean, something like this?

https://ataraxy-labs.github.io/quiver/

https://x.com/Palanikannan_M/status/2022190215021126004

tomaskafka 11 hours ago | parent | prev [-]

This sounds like an AI generated self promo slop made for reddit, did you accidentally paste it into a wrong website?

rs545837 11 hours ago | parent | next [-]

No it was meant to be here, I don't like reddit as much as I love HN.

conartist6 11 hours ago | parent | prev [-]

If you don't understand what's being talked about, maybe best to stay quiet

zokier 10 hours ago | parent | prev | next [-]

> At this point, the question is: why keep files as blobs in the first place. If a revision control system stores AST trees instead, all the work is AST-level.

The problem is that disks (and storage in general) store only bytes so you inherently need to deal with bytes at some point. You could view source code files as the serialization of the AST (or other parse tree).

This is especially apparent with LISPs and their sexprs, but equally applies to other languages too.

pfdietz 14 hours ago | parent | prev | next [-]

Well, if you're programming in C or C++, there may not be a parse tree. Tree-sitter makes a best effort attempt to parse but it can't in general due to the preprocessor.

rs545837 14 hours ago | parent [-]

Great point. C/C++ with macros and preprocessor directives is where tree-sitter's error recovery gets stretched. We support both C and C++ in sem-core(https://github.com/Ataraxy-Labs/sem) but the entity extraction is best-effort for heavily macro'd code. For most application-level C++ it works well, but something like the Linux kernel would be rough. Honestly that's an argument for gritzko's AST-native storage approach where the parser can be more tightly integrated.

pfdietz 9 hours ago | parent [-]

It's an argument against preprocessors for programming languages.

Tree-sitter's error handling is constrained by its intended use in editors, so incrementality and efficiency are important. For diffing/merging, a more elaborate parsing algorithm might be better, for example one that uses an Earley/CYK-like algorithm but attempts to minimize some error term (which a dynamic programming algorithm could be naturally extended to.)

rs545837 4 hours ago | parent [-]

Interesting idea. Tree-sitter's trade-off (speed + incrementality over completeness) makes sense for editors but you're right that for merge/diff a more thorough parser could be worth the cost since it's a cold path, not real-time. We only parse three file versions at merge time so spending an extra 50ms on a better parse would be fine. Worth exploring, thanks for the pointer.

samuelstros 13 hours ago | parent | prev | next [-]

How do you get blob file writes fast?

I built lix [0] which stores AST’s instead of blobs.

Direct AST writing works for apps that are “ast aware”. And I can confirm, it works great.

But, the all software just writes bytes atm.

The binary -> parse -> diff is too slow.

The parse and diff step need to get out of the hot path. That semi defeats the idea of a VCS that stores ASTs though.

[0] https://github.com/opral/lix

gritzko 12 hours ago | parent | next [-]

I only diff the changed files. Producing blob out of BASON AST is trivial (one scan). Things may get slow for larger files, e.g. tree-sitter C++ parser is 25MB C file, 750KLoC. Takes couple seconds to import it. But it never changes, so no biggie.

There is room for improvement, but that is not a show-stopper so far. I plan round-tripping Linux kernel with full history, must show all the bottlenecks.

P.S. I checked lix. It uses a SQL database. That solves some things, but also creates an impedance mismatch. Must be x10 slow down at least. I use key-value and a custom binary format, so it works nice. Can go one level deeper still, use a custom storage engine, it will be even faster. Git is all custom.

rs545837 2 hours ago | parent [-]

Good framing. Source code is already a serialization of an AST, we just forgot that and started treating it as text. The practical problem is adoption: every tool in the ecosystem reads bytes.

rs545837 12 hours ago | parent | prev [-]

This is exactly a reason why weave stays on top of git instead of replacing storage. Parsing three file versions at merge time is fine (was about 5-67ms). Parsing on every read/write would be a different story. I know about Lix, but will check it out again.

orbisvicis 10 hours ago | parent | prev | next [-]

That's a really good point. I'm not familiar with Unison, but I think that's the idea behind the language?

https://www.unison-lang.org/

rs545837 10 hours ago | parent [-]

This is actually cool, gonna check it out.

philipallstar 8 hours ago | parent | prev | next [-]

You might need a bit more than ASTs, as you need code to be human-readable as well as machine-readable. Maybe CSTs?

rs545837 3 hours ago | parent [-]

CSTs are the right call for round-tripping, but isn't that essentially what tree-sitter gives you, a concrete syntax tree that preserves whitespace, comments, and formatting.

jerf 5 hours ago | parent | prev | next [-]

Everything on a disk ends up as a linear sequence of bytes. This is the source of the term "serialization", which I think is easy to hear as a magic word without realizing that it is actually telling you something important in its etymology: It is the process of taking an arbitrary data structure and turning it into something that can be sent or stored serially, that is, in an order, one bit at a time if you really get down to it. To turn something into a file, to send something over a socket, to read something off a sheet of paper to someone else, it has to be serialized.

The process of taking such a linear stream and reconstructing the arbitrary data structure used to generate it (or, in more sophisticated cases, something related to it if not identical), is deserialization. You can't send anyone a cyclic graph directly but you can send them something they can deserialize into a cyclic graph if you arrange the serialization/deserialization protocol correctly. They may deserialize it into a raw string in some programming language so they can run regexes over it. They may deserialize it into a stream of tokens. This all happens from the same source of serialized data.

So let's say we have an AST in memory. As complicated as your language likes, however recursive, however cross-"module", however bizarre it may be. But you want to store it on a disk or send it somewhere else. In that case it must be serialized and then deserialized.

What determines what the final user ends up with is not the serialization protocol. What determines what the final user ends up with is the deserialization procedure they use. They may, for instance, drop everything except some declaration of what a "package" is if they're just doing some initial scan. They may deserialize it into a compiler's AST. They may deserialize it into tree sitter's AST. They may deserialize it into some other proprietary AST used by a proprietary static code analyzer with objects designed to not just represent the code but also be immediately useful in complicated flow analyses that no other user of the data is interested in using.

The point of this seemingly rambling description of what serialization is is that

"why keep files as blobs in the first place. If a revision control system stores AST trees instead"

doesn't correspond to anything actionable or real. Structured text files are already your programming language's code stored as ASTs. The corresponding deserialization format involves "parsing" them, which is a perfectly sensible and very, very common deserialization method. For example, the HTML you are reading was deserialized into the browser's data structures, which are substantially richer than "just" an AST of HTML due to all the stuff a browser does with the HTML, with a very complicated parsing algorithm defined by the HTML standard. The textual representation may be slightly suboptimal for some purposes but they're pretty good at others (e.g., lots of regexes run against code over the years). If you want some other data structure in the consumer, the change has to happen in the code that consumes the serialized stream. There is no way to change the code as it is stored on disk to make it "more" or "less" AST-ish than it already is, and always has been.

You can see that in the article under discussion. You don't have to change the source code, which is to say, the serialized representation of code on the disk, to get this new feature. You just have to change the deserializer, in this case, to use tree sitter to parse instead of deserializing into "an array of lines which are themselves just strings except maybe we ignore whitespace for some purposes".

Once you see the source code as already being an AST, it is easy to see that there are multiple ways you could store it that could conceivably be optimized for other uses... but nothing you do to the serialization format is going to change what is possible at all, only adjust the speed at which it can be done. There is no "more AST-ish" representation that will make this tree sitter code any easier to write. What is on the disk is already maximally "AST-ish" as it is today. There isn't any "AST-ish"-ness being left on the table. The problem was always the consumers, not the representation.

And as far as I can tell, it isn't generally the raw deserialization speed nowadays that is the problem with source code. Optimizing the format for any other purpose would break the simple ability to read it is as source code, which is valuable in its own right. But then, nothing stops you from representing source code in some other way right now if you want... but that doesn't open up possibilities that were previously impossible, it just tweak how quickly some things will run.

handfuloflight 14 hours ago | parent | prev [-]

Well, I'll be diving in. Thank you for sharing. Same for Weave.

rs545837 14 hours ago | parent [-]

Awesome, let me know how it goes. Happy to help if you hit any rough edges.