| ▲ | gritzko 16 hours ago | ||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
At this point, the question is: why keep files as blobs in the first place. If a revision control system stores AST trees instead, all the work is AST-level. One can run SQL-level queries then to see what is changing where. Like
Almost LSP-level querying, involving versions and branches.
Beagle is a revision control system like that [1]It is quite early stage, but the surprising finding is: instead of being a depository of source code blobs, an SCM can be the hub of all activities. Beagle's architecture is extremely open in the assumption that a lot of things can be built on top of it. Essentially, it is a key-value db, keys are URIs and values are BASON (binary mergeable JSON) [2] Can't be more open than that. [1]: https://github.com/gritzko/librdx/tree/master/be [2]: https://github.com/gritzko/librdx/blob/master/be/STORE.md | |||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
| ▲ | rs545837 16 hours ago | parent | next [-] | ||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
This is the right question. Storing ASTs directly would make all of this native instead of layered on top. The pragmatic reason weave works at the git layer: adoption. Getting people to switch merge drivers is hard enough, getting them to switch VCS is nearly impossible. So weave parses the three file versions on the fly during merge, extracts entities, resolves per-entity, and writes back a normal file that git stores as a blob. You get entity-level merging without anyone changing their workflow. But you're pointing at the ceiling of that approach. A VCS that stores ASTs natively could answer "did any concurrent branches touch this function?" as a query, not as a computation. That's a fundamentally different capability. Beagle looks interesting, will dig into the BASON format. We built something adjacent with sem (https://github.com/ataraxy-labs/sem) which extracts the entity dependency graph from git history. It can answer "what new uses did this function accrete" and "what's the blast radius of this change" but it's still a layer on top of git, not native storage. | |||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
| |||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
| ▲ | zokier 10 hours ago | parent | prev | next [-] | ||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
> At this point, the question is: why keep files as blobs in the first place. If a revision control system stores AST trees instead, all the work is AST-level. The problem is that disks (and storage in general) store only bytes so you inherently need to deal with bytes at some point. You could view source code files as the serialization of the AST (or other parse tree). This is especially apparent with LISPs and their sexprs, but equally applies to other languages too. | |||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
| ▲ | pfdietz 14 hours ago | parent | prev | next [-] | ||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
Well, if you're programming in C or C++, there may not be a parse tree. Tree-sitter makes a best effort attempt to parse but it can't in general due to the preprocessor. | |||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
| |||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
| ▲ | samuelstros 13 hours ago | parent | prev | next [-] | ||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
How do you get blob file writes fast? I built lix [0] which stores AST’s instead of blobs. Direct AST writing works for apps that are “ast aware”. And I can confirm, it works great. But, the all software just writes bytes atm. The binary -> parse -> diff is too slow. The parse and diff step need to get out of the hot path. That semi defeats the idea of a VCS that stores ASTs though. | |||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
| |||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
| ▲ | orbisvicis 10 hours ago | parent | prev | next [-] | ||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
That's a really good point. I'm not familiar with Unison, but I think that's the idea behind the language? | |||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
| |||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
| ▲ | philipallstar 8 hours ago | parent | prev | next [-] | ||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
You might need a bit more than ASTs, as you need code to be human-readable as well as machine-readable. Maybe CSTs? | |||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
| |||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
| ▲ | jerf 5 hours ago | parent | prev | next [-] | ||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
Everything on a disk ends up as a linear sequence of bytes. This is the source of the term "serialization", which I think is easy to hear as a magic word without realizing that it is actually telling you something important in its etymology: It is the process of taking an arbitrary data structure and turning it into something that can be sent or stored serially, that is, in an order, one bit at a time if you really get down to it. To turn something into a file, to send something over a socket, to read something off a sheet of paper to someone else, it has to be serialized. The process of taking such a linear stream and reconstructing the arbitrary data structure used to generate it (or, in more sophisticated cases, something related to it if not identical), is deserialization. You can't send anyone a cyclic graph directly but you can send them something they can deserialize into a cyclic graph if you arrange the serialization/deserialization protocol correctly. They may deserialize it into a raw string in some programming language so they can run regexes over it. They may deserialize it into a stream of tokens. This all happens from the same source of serialized data. So let's say we have an AST in memory. As complicated as your language likes, however recursive, however cross-"module", however bizarre it may be. But you want to store it on a disk or send it somewhere else. In that case it must be serialized and then deserialized. What determines what the final user ends up with is not the serialization protocol. What determines what the final user ends up with is the deserialization procedure they use. They may, for instance, drop everything except some declaration of what a "package" is if they're just doing some initial scan. They may deserialize it into a compiler's AST. They may deserialize it into tree sitter's AST. They may deserialize it into some other proprietary AST used by a proprietary static code analyzer with objects designed to not just represent the code but also be immediately useful in complicated flow analyses that no other user of the data is interested in using. The point of this seemingly rambling description of what serialization is is that "why keep files as blobs in the first place. If a revision control system stores AST trees instead" doesn't correspond to anything actionable or real. Structured text files are already your programming language's code stored as ASTs. The corresponding deserialization format involves "parsing" them, which is a perfectly sensible and very, very common deserialization method. For example, the HTML you are reading was deserialized into the browser's data structures, which are substantially richer than "just" an AST of HTML due to all the stuff a browser does with the HTML, with a very complicated parsing algorithm defined by the HTML standard. The textual representation may be slightly suboptimal for some purposes but they're pretty good at others (e.g., lots of regexes run against code over the years). If you want some other data structure in the consumer, the change has to happen in the code that consumes the serialized stream. There is no way to change the code as it is stored on disk to make it "more" or "less" AST-ish than it already is, and always has been. You can see that in the article under discussion. You don't have to change the source code, which is to say, the serialized representation of code on the disk, to get this new feature. You just have to change the deserializer, in this case, to use tree sitter to parse instead of deserializing into "an array of lines which are themselves just strings except maybe we ignore whitespace for some purposes". Once you see the source code as already being an AST, it is easy to see that there are multiple ways you could store it that could conceivably be optimized for other uses... but nothing you do to the serialization format is going to change what is possible at all, only adjust the speed at which it can be done. There is no "more AST-ish" representation that will make this tree sitter code any easier to write. What is on the disk is already maximally "AST-ish" as it is today. There isn't any "AST-ish"-ness being left on the table. The problem was always the consumers, not the representation. And as far as I can tell, it isn't generally the raw deserialization speed nowadays that is the problem with source code. Optimizing the format for any other purpose would break the simple ability to read it is as source code, which is valuable in its own right. But then, nothing stops you from representing source code in some other way right now if you want... but that doesn't open up possibilities that were previously impossible, it just tweak how quickly some things will run. | |||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
| ▲ | handfuloflight 14 hours ago | parent | prev [-] | ||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
Well, I'll be diving in. Thank you for sharing. Same for Weave. | |||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
| |||||||||||||||||||||||||||||||||||||||||||||||||||||||||||