| ▲ | jerf 3 hours ago | |
Everything on a disk ends up as a linear sequence of bytes. This is the source of the term "serialization", which I think is easy to hear as a magic word without realizing that it is actually telling you something important in its etymology: It is the process of taking an arbitrary data structure and turning it into something that can be sent or stored serially, that is, in an order, one bit at a time if you really get down to it. To turn something into a file, to send something over a socket, to read something off a sheet of paper to someone else, it has to be serialized. The process of taking such a linear stream and reconstructing the arbitrary data structure used to generate it (or, in more sophisticated cases, something related to it if not identical), is deserialization. You can't send anyone a cyclic graph directly but you can send them something they can deserialize into a cyclic graph if you arrange the serialization/deserialization protocol correctly. They may deserialize it into a raw string in some programming language so they can run regexes over it. They may deserialize it into a stream of tokens. This all happens from the same source of serialized data. So let's say we have an AST in memory. As complicated as your language likes, however recursive, however cross-"module", however bizarre it may be. But you want to store it on a disk or send it somewhere else. In that case it must be serialized and then deserialized. What determines what the final user ends up with is not the serialization protocol. What determines what the final user ends up with is the deserialization procedure they use. They may, for instance, drop everything except some declaration of what a "package" is if they're just doing some initial scan. They may deserialize it into a compiler's AST. They may deserialize it into tree sitter's AST. They may deserialize it into some other proprietary AST used by a proprietary static code analyzer with objects designed to not just represent the code but also be immediately useful in complicated flow analyses that no other user of the data is interested in using. The point of this seemingly rambling description of what serialization is is that "why keep files as blobs in the first place. If a revision control system stores AST trees instead" doesn't correspond to anything actionable or real. Structured text files are already your programming language's code stored as ASTs. The corresponding deserialization format involves "parsing" them, which is a perfectly sensible and very, very common deserialization method. For example, the HTML you are reading was deserialized into the browser's data structures, which are substantially richer than "just" an AST of HTML due to all the stuff a browser does with the HTML, with a very complicated parsing algorithm defined by the HTML standard. The textual representation may be slightly suboptimal for some purposes but they're pretty good at others (e.g., lots of regexes run against code over the years). If you want some other data structure in the consumer, the change has to happen in the code that consumes the serialized stream. There is no way to change the code as it is stored on disk to make it "more" or "less" AST-ish than it already is, and always has been. You can see that in the article under discussion. You don't have to change the source code, which is to say, the serialized representation of code on the disk, to get this new feature. You just have to change the deserializer, in this case, to use tree sitter to parse instead of deserializing into "an array of lines which are themselves just strings except maybe we ignore whitespace for some purposes". Once you see the source code as already being an AST, it is easy to see that there are multiple ways you could store it that could conceivably be optimized for other uses... but nothing you do to the serialization format is going to change what is possible at all, only adjust the speed at which it can be done. There is no "more AST-ish" representation that will make this tree sitter code any easier to write. What is on the disk is already maximally "AST-ish" as it is today. There isn't any "AST-ish"-ness being left on the table. The problem was always the consumers, not the representation. And as far as I can tell, it isn't generally the raw deserialization speed nowadays that is the problem with source code. Optimizing the format for any other purpose would break the simple ability to read it is as source code, which is valuable in its own right. But then, nothing stops you from representing source code in some other way right now if you want... but that doesn't open up possibilities that were previously impossible, it just tweak how quickly some things will run. | ||