| ▲ | kelseyfrog 4 days ago |
| The tradeoff here is not being able to use a universal set of tooling to interact with source files. Anything but text makes grep, diff, sed, and version control less effective. You end up locked into specialized tools, formats, or IDE extensions, while the Unix philosophy thrives on composability with plain text. There's a scissor that cuts through the formatting debate: If initial space width was configurable in their editor of choice, would those who prefer tabs have any other arguments? |
|
| ▲ | gr__or 4 days ago | parent | next [-] |
| Text surely is a hill, but I believe it's a local one, we got stuck on due to our short-sighted inability to go into a valley for a few miles until we find the (projectional) mountain. All of your examples work better for code with structural knowledge: - grep: symbol search (I use it about 100x as often as a text grep) or https://github.com/ast-grep/ast-grep - diff: https://semanticdiff.com (and others), i.e.: hide noisy syntax only changes, attempt to capture moved code. I say attempt, because with projectional programming we could have a more expressive notion of code being moved - sed: https://npmjs.com/package/@codemod/cli - version control: I'd look towards languages like Unison to see what funky things we could do here, especially for libraries. A general example: no conflicts due to non-semantic changes (re-orderings, irrelevant whitespaces, etc.) |
| |
| ▲ | zokier 4 days ago | parent | next [-] | | But as the tools you link demonstrate, having "text" as the on-disk format does not preclude AST based (or even smarter) tools. So there is little benefit in having non-text format. Ultimately it's all just bytes on disk | | |
| ▲ | gr__or 3 days ago | parent | next [-] | | Even that is not without its cost. Most of these tools are written in different languages, which all have to maintain their own parsers, which have to keep up with language changes. And there are abilities we lose completely by making text the source of truth, like a reliable version control for "this function moved to a new file". | | |
| ▲ | theamk 3 days ago | parent [-] | | At least the parsers are optional now - you can still grep, diff, etc.. even if your tools have no idea about language's semantics. But if you store ASTs, you _have_ to have the support of each of the language for each of the tools (because each language has its own AST). This basically means a major chicken-and-egg problem - a new language won't be compatible with any of the tools, so the adoption will be very low until the editor, diff, sed etc.. are all updated.. and those tools won't be updated until the language is popular. And you still don't get any advantages over text! For example, if you really cared about "this function moved to new file" functionality, you could have unique id after each function ("def myfunc{f8fa2bdd}..."), and insert/hide them in your editor. This way the IDE can show nice definition, but grep/git etc.. still work but with extra noise. In fact, I bet that any technology that people claim requires non-readable AST files, can be implemented as text for many extra upsides and no major downsides (with the obvious exception of truly graphical things - naive diffs on auto-generated images, graphs or schematics files are not going to be very useful, no matter what kind of text format is used) Want to have each person see it's own formatting style? Reformat to person's style on load and format back to project style on save. Modern formatters are so fast, people won't even notice this. Want fast semantic search? Maintain the binary cache files, but use text as source-of-truth. Want better diff output? Same deal, parse and cache. Want to have no files, but instead have function list and edit each one directly, a la Smalltalk? Maintain files transparently with text code - maybe one file per function, or one file per class, or one per project... The reason people keep source code as text as it's really a global maximum. The non-text format gives you a modest speedup, but at the expense of imposing incredible version compatibility pain. | | |
| ▲ | gr__or 3 days ago | parent [-] | | The complexity of a parser is orders of magnitude higher than that of an AST schema. I'm also not saying we can have all these good things, but they are not free, and the costs are more spread out and thus less obviously noticeable than the ones projectional code imposes. | | |
| ▲ | theamk 3 days ago | parent [-] | | Are you talking about runtime complexity or programming-time complexity? If the runtime, then I bet almost no one will notice, especially if the appropriate caching is used. If the programming-time - sure, but it's not like you can avoid parsers altogether. If the parsers are not in the tools, they must be in IDE.
Factor out that parsing logic, and make it a library all the tools can use (or a one-shot LSP server if you are in the language that has hard-to-use bindings). Note even with AST-in-file approach, you _still_ need the library to read and write that AST, it's not like you can have a shared AST schema for multiple languages. So either way, tools like diff will need to have a wide variety of libraries linked in, one for each language they support. And at that point, there is not much difference between AST reader and code parser. | | |
| ▲ | gr__or 3 days ago | parent [-] | | I meant programming-time, but runtime is also a good point. Cross-language libraries don't seem to be super common for this. The recovering-sense-from-text tools I named all use different parsers in their respective languages. Again, reading (and yes, technically that's also parsing) from an AST from a data-exchange formatted file is mags simpler. And for parsing these schemes there are battle-tested cross-language solutions, e.g. protobuf. |
|
|
|
| |
| ▲ | rafaelmn 3 days ago | parent | prev [-] | | Why even have a database - let's just keep the data in CSVs, we can grep it easily, it's all bytes on a disk. |
| |
| ▲ | gorgoiler 3 days ago | parent | prev | next [-] | | I feel it’s important to stick up for the difference between text and code. The two overlap a lot, but not all text is code, even if most code is text. It’s a really subtle difference but I can’t quite put my finger on why it is important. I think of all the little text files I’ve made over the decades that record information in various different ways where the only real syntax they share is that they use short lines (80 columns) and use line orientation for semantics (lah-dee-dah way of saying lots of lists!) I have a lot of experience of being firmly ensconced in software engineering environments where the only resources being authored and edited were source code files. But I’ve also had a lot of experience of the kind of admin /
project / clerical work where you make up files as you go along. Teaching in a high school was a great place to practice that kind of thing. | |
| ▲ | kelseyfrog 3 days ago | parent | prev | next [-] | | Thank you for your response. Conveniently, we can use an existing example - Clang's pch files. Could you walk me through using grep, diff, sed, and git on pch? I'd really appreciate it. | |
| ▲ | jrochkind1 3 days ago | parent | prev | next [-] | | So there was an era, as the OP says, where your arguments were popular and believed and it was understood that things would move in this direction. And yet it didn't, it reversed. I think the fact that "plain text for all source files" actually won in the actual ecosystem wasn't just because too many developers had the wrong idea/short-sightedness -- because in fact most influential people wanted and believed in what you say. It's because there are real factors that make the level of investment required for the other paths unsustainable, at least compared to the text source path. it's definitely related to the "victory" of unix and unix-style OSs. Which is often understood as the victory of a philosophy of doing it cheaper, easier, simpler, faster, "good enough". It's also got to do with how often languages and platforms change -- both change within a language/platform and languages/platforms rising and falling. Sometimes I wish this was less quick, I'm definitely a guy who wants to develop real expertise with a system by using it over a long time, and think you can work so much more effectively and productively when you have done such. But the actual speed of change of platforms and languages we see depends on reduced cost of tooling. | | |
| ▲ | gr__or 3 days ago | parent [-] | | For me, that's what "short-sighted inability" means. The business ecosystem we have does not have the attention span for this kind of project. What we need is individuals grouping together against the gradient of incentives (which is hard indeed). |
| |
| ▲ | Tooster 4 days ago | parent | prev [-] | | I’d also add: * [Difftastic](https://difftastic.wilfred.me.uk/) — my go-to diff tool for years
* [Nu shell](https://www.nushell.sh/) — a promising idea, but still lacking in design/implementation maturity What I’d really like to see is a *viable projectional editor* and a broader shift from text-centric to data-centric tools. The issue is that nearly everything we use today (editors, IDEs, coreutils) is built around text, and there’s no agreed-upon data interchange format. There have been attempts (Unison, JetBrains MCP, Nu shell), but none have gained real traction. Rare “miracles” like the C++ --> Rust migration show paradigm shifts can happen. But a text → projectional transition would be even bigger. For that to succeed, someone influential would need to offer a *clear, opt-in migration path* where: * some people stick with text-based tools,
* others move to semantic model editing,
* and both can interoperate in the same codebase. What would be needed: * Robust, data-native alternatives to [coreutils](https://wiki.archlinux.org/title/Core_utilities) operating directly on structured data (avoid serialize ↔ parse boundaries). Learn from Nushell’s mistakes, and aim for future-compatible, stable, battle-tested tools.
* A more declarative-first mindset.
* Strong theoretical foundations for the new paradigm.
* Seamless conversion between text-based and semantic models.
* New tools that work with mainstream languages (not niche reinventions), and enforce correctness at construction time (no invalid programs).
* Integration of semantic model with existing version control systems
* Shared standards for semantic models across languages/tools (something on the scale of MCP or LSP — JetBrains’ are better, but LSP won thanks to Microsoft’s push).
* Dual compatibility in existing editors/IDEs (e.g. VSCode supporting both text files and semantic models).
* Integrate knowledge across many different projects to distill the best way forward -> for example learn from Roslyn's semantic vs syntax model, look into tree sitter, check how difftastic does tree diffing, find tree regex engines, learn from S-expressions and LISP like languages, check unison, adopt helix editor/vim editing model, see how it can eb integrated with LSP and MCP etc. This isn’t something you can brute-force — it needs careful planning and design before implementation. The train started on text rails and won’t stop, so the only way forward is to *build an alternative track* and make switching both gradual and worthwhile. Unfortunately it is pretty impossible to do for an entity without enough influence. | | |
|
|
| ▲ | jsharpe 4 days ago | parent | prev | next [-] |
| Exactly. This idea comes up time and time again, but the cost/benefit just doesn't make sense at all. You're adding an unbelievable amount of complex tooling just to avoid running a simple formatter. The goal of having every developer viewing the code with their own preferences just isn't that important. On every team I've been on, we just use a standard style guide, enforced by formatter, and while not everyone agrees with every rule, it just doesn't matter. You get used to it. Arguing and obsessing about code formatting is simply useless bikeshedding. |
| |
| ▲ | scubbo 4 days ago | parent | next [-] | | I disagree with almost every choice made by the Go language designers, but `Gofmt's style is no one's favorite, yet gofmt is everyone's favorite` is solid. Pick a not-unreasonable standard, enforce it, and move on to more important things. | | |
| ▲ | spyspy 4 days ago | parent [-] | | My only complaint about gofmt is that it’s not even stricter about some things. | | |
| |
| ▲ | rbits 4 days ago | parent | prev | next [-] | | Yeah it would probably be a waste of time. It's a nice idea to dream about though. It would be nice to be able to look at some C# code and not have opening curly brackets on a separate line. | | | |
| ▲ | Buttons840 4 days ago | parent | prev | next [-] | | > Arguing and obsessing about code formatting is simply useless bikeshedding. Unless it's an accessibility issue, and it is an accessibility issue sometimes. | | | |
| ▲ | raspasov 4 days ago | parent | prev [-] | | >> The goal of having every developer viewing the code with their own preferences just isn't that important. Bah! So, what is more important? Is the average convenience of the herd more important? Average of the convenience, even if there was ever such a thing. What if you really liked reading books in paper format, but were forced to read them on displays for... reasons? |
|
|
| ▲ | accelbred 4 days ago | parent | prev | next [-] |
| What if the common intermediate encoding is text, not binary?
Then grep/diff/sed all still work. If we had a formatting tool that operated solely on AST, checked in code could be in a canonical form for a given AST. Editors could then parse the AST and display the source with a different formatting of the users choice, and convert to canonical form when writing the file to disk. |
| |
| ▲ | sublinear 4 days ago | parent | next [-] | | Nobody wants to have to run their own formatter rules in reverse in their head just to know what to grep for. That defeats the point of formatting at all. | | |
| ▲ | pwdisswordfishz 4 days ago | parent | next [-] | | That's why you grep for a syntactic structure, not undifferentiated text. | | |
| ▲ | michaelmrose 4 days ago | parent [-] | | Which grep doesn't do and you need to either use a new different tool or more likely several for little real benefit | | |
| ▲ | hnlmorg 4 days ago | parent | next [-] | | grep is half a century old now. If we can’t progress our ecosystem because we are reliant on one very specific 50+ year old line parser, then that says more about the inflexibility of the industry to move forward than it does about the “new” ideas being presented. | | |
| ▲ | account42 4 days ago | parent | next [-] | | We still use grep because its useful. And it's useful precisely because it doesn't depend on syntax so will work on anything text based. | | |
| ▲ | hnlmorg 3 days ago | parent [-] | | grep is great. My point isn’t that we shouldn’t use it. My point is that we shouldn’t be held back by it. |
| |
| ▲ | komali2 4 days ago | parent | prev [-] | | The things all being described are way beyond non trivial to solve, and they'd need to be solved for every language. Grep works great. | | |
| ▲ | hnlmorg 3 days ago | parent [-] | | > The things all being described are way beyond non trivial to solve, and they'd need to be solved for every language. Except it already is a solved problem. If languages compile to a common byte code then you just need one tool. You already see examples of this with things like the IR assembly produced by LLVM, various Microsoft languages that compile to CLR, and the different languages that target JVM. There are also already common ways to create reusable parsing rules like LSP for IDEs and treesitter. In fact there are already grep-like utilities that are based on treesitter. So it’s not only very possible to create language agnostic, reusable, tools; but these tools already exist and being used by a great many developers. The problem raised in the article is that we just don’t push these concepts hard enough these days. Instead relying on outdated concepts of what source code should look like. > Grep works great For LF-separated lists it does. But if it worked great for structured content then we wouldn’t be having this conversation to begin with. |
|
| |
| ▲ | jitl 4 days ago | parent | prev [-] | | comby is fantastic, give it a shot. It’s saved me huge amounts of time. |
|
| |
| ▲ | theamk 3 days ago | parent | prev [-] | | You'd need all-news tools for non-text world as well. So the real choice is either: - new tool: grep with caching reverse-formatter filter. - new tool: ast-grep with understanding of AST serialization format for your specific language. At least in the first case, you still have fall back. |
| |
| ▲ | pmontra 4 days ago | parent | prev [-] | | All mainstream editors that agree to work on a standard AST for any given language could be nice. I'm not expecting that to happen at any time in future. About grep and diff working on a textual representation of the AST, it would be like grepping on Javascript source code when the actual source code is Typescript or some other more distant language that compiles to Javascript (does anybody remember Coffescript?) We want to see only the source code we typed in. By the way, add git diff to the list of tools that should work on the AST but show us the real source code. |
|
|
| ▲ | rendaw 4 days ago | parent | prev | next [-] |
| Grep, diff, sed, and line-based non-semantic merge are all terrible tools for manipulating code... rather than dig ourselves in either further with those maybe a reason to come up with something better would be good. |
|
| ▲ | Avshalom 4 days ago | parent | prev | next [-] |
| The entire OS was built around these source files. the unix philosophy on the other hand only "thrives" if every other tool is designed around (and contains code to parse) "plain text" |
| |
| ▲ | lmm 4 days ago | parent [-] | | > The entire OS was built around these source files. And how did that work out for them? This seems like one of the many cases where unix won out by being a lowest common denominator. Every platform can handle plain text. | | |
| ▲ | account42 4 days ago | parent | next [-] | | Not all platforms come with powerful text handling tools out of the box - or at least they didn't used to until Unix-based systems forced them to catch up. | |
| ▲ | aleph_minus_one 4 days ago | parent | prev [-] | | >
This seems like one of the many cases where unix won out by being a lowest common denominator. The lowest common denominator rather is binary blobs. :-) | | |
| ▲ | thfuran 3 days ago | parent [-] | | The conversion of which to text and back has historically proven rather fraught. |
|
|
|
|
| ▲ | MyOutfitIsVague 4 days ago | parent | prev | next [-] |
| The way I envision this working is with something like git filters. Checking out from version control converts it all into text in your preferred formatting, which you then work with as expected. Staging it converts it into the stored representation. In git, this would be done with smudge and clean filters, like how git LFS works. You'd also have viewers for forges and the like that are built to interpret all the stored representations as needed. You still work with text, the text just isn't the canonical stored representation. You get diffs to resolve only when structure is changed. You get most of the same benefit with a pre-commit linter hook, though. |
| |
| ▲ | zokier 4 days ago | parent | next [-] | | The problem is that there is little benefit in not having the canonical stored representation be text. The crucial thing is to have some canonical representation but it might as well be human readable. | |
| ▲ | bapak 4 days ago | parent | prev | next [-] | | This is it, unfortunately git is "too dumb" for this. In order to merge code, it would have to either understand the AST. What happens when you stage the line `} else return {`? git doesn't allow to stage specific AST nodes. It would also mean that you can't stage partial code (that produces syntax errors) | | |
| ▲ | zokier 4 days ago | parent | next [-] | | Git can use arbitrary merge (and diff) tools. Something like https://mergiraf.org/introduction.html works with git and gets you ast aware merging. Do not underestimate gits flexibility. | |
| ▲ | Hendrikto 4 days ago | parent | prev [-] | | Smudge and clean filters work on text, git would not need to change at all. You would still store text, and still check out text, just transformed text. You could still check in anything you want, including partial code, syntax errors, or any other arbitrary text. Diffs would work the same way they do now. |
| |
| ▲ | account42 4 days ago | parent | prev [-] | | Please no, git trying to automatically "correct" \n vs \r\n line endings is already horrible enough. At least you can turn that off. |
|
|
| ▲ | danielheath 4 days ago | parent | prev | next [-] |
| If you’re going to store the source in a canonical format and unpack that to suit each developer… why should the canonical format just be regular source code? All the same tools can exist with a text backend, and you get grep/sed support for free too! |
| |
| ▲ | psychoslave 4 days ago | parent | next [-] | | That’s seems like a genious remark actually. If you store the abstract objects and have the mechanism to transform to whatever the desired output form is, it’s almost trivial to expose a version as files and text rendering for tools that are thus oriented, isn’t it? | | |
| ▲ | danielheath 3 days ago | parent [-] | | Originally my fathers idea from back in the 90s to create a language with a whole suite of syntactic representations to suit your preferences. Want it to look like C? Lisp? Pascal? Why not! | | |
| |
| ▲ | giveita 4 days ago | parent | prev [-] | | My grep may not work on your settings for the same code. This becomes an issue with say CI where maybe I add a gate to check something with grep. But whose format do I assume? My local (that I used to test it locally) or the canonical (which means I need to switch local format to test it)? | | |
| ▲ | brabel 4 days ago | parent | next [-] | | You really rely on grep on CI? How fragile is that ?! This is a good argument for storing non-text. Grepping code is laughably unreliable. The only way to write things like that reliably is by actually parsing the code and working in its AST. Working in text is like writing code in a completely untyped language. It can be done, but it’s beyond stupid for anything where accuracy matters. | |
| ▲ | treadmill 4 days ago | parent | prev [-] | | You're misunderstanding the idea I think. You would use the format on disk for the grep. "Your format" only exists displayed in your editor. | | |
|
|
|
| ▲ | eviks 4 days ago | parent | prev | next [-] |
| > If initial space width was configurable in their editor of choice, would those who prefer tabs have any other arguments? Yes, of course, because tab width is * dynamically* flexible, so initial space width isn't enough |
| |
| ▲ | pasc1878 4 days ago | parent [-] | | Yes because if you want to deindent with tabs it is just delete one character whilst spaces requires you top delete x characters where x is the number of spaces you indent by. | | |
| ▲ | eviks 4 days ago | parent [-] | | For "clean-fixed-width" unambiguous indent (eg, at the beginning of lines) you can make delete also delete X=indent_width spaces. But for "dirty-width" indents, eg, after some text that can vary in size (proportional fonts or some special chars even in fixed fonts) you can't align with spaces while a tab width can be auto-adjusted to match the other line |
|
|
|
| ▲ | charcircuit 4 days ago | parent | prev | next [-] |
| In practice how many tools do you really need to handle the custom format? Probably single digits and they could all use a common library to handle the formatting aspect of things. |
|
| ▲ | aleph_minus_one 4 days ago | parent | prev | next [-] |
| > Anything but text makes grep, diff, sed, and version control less effective. Perhaps this is rather a design mistake in how UNIX handles things and is so focused on text. |
|
| ▲ | bee_rider 4 days ago | parent | prev | next [-] |
| Is it possible converted from the DIANA ir back to something that looks like source code? Then the result of the conversion backward could be grepped, etc… |
| |
| ▲ | teo_zero 4 days ago | parent [-] | | From TFA: > Everyone had their own pretty-printing settings for viewing [DIANA] however they wanted. | | |
| ▲ | bee_rider 4 days ago | parent [-] | | > Back when he was working on Ada, they didn't store text sources at all — they used an IR called DIANA. Everyone had their own pretty-printing settings for viewing it however they wanted. I’m still confused because the specifically call the IR DIANA, and they talk about viewing the IR. It isn’t clear to me if the IR is more like a bytecode or something, or more like just the original source code with a little processing done to it. They also have a quote, > Grady Booch summarizes it well: R1000 was effectively a DIANA machine. We didn't store source code: source code was simply a pretty-printing of the DIANA tree. So maybe the other visualizations they could do by transforming the IR were so nice that nobody even cared to look at the original ADA that they’d written to generate it? | | |
| ▲ | brabel 4 days ago | parent [-] | | I imagine it’s like storing JVM bytecode, ie class files instead of Java files. So when you open it up the editor decompiles it , like IntelliJ does if you try to open a class file, but then it also applies your own style, like from .editorconfig, on the code it shows. It’s a really good idea and I can’t believe people here are complaining that it’s bad because they can’t use grep! But that’s a good thing!! Who the hell is grepping code as if code had no structure and that’s the best you can do? So you also grep JSON instead of using jq? Just don’t! |
|
|
|
|
| ▲ | cowsandmilk 4 days ago | parent | prev | next [-] |
| How is diff less effective? I see the diff in the formatting I prefer? With sed, I can project the source into a formatting most convenient for what I’m trying to do with sed. And I have no idea what you’re on about version control. It ruins sending patch files that require a line number around, but most places don’t do that any more. What I would be curious on is tracing from errors back to the source code. Nearly every language I’ve used prints line number and offset on the line for the error. How that worked in the Diana world would be interesting to learn. |
| |
| ▲ | sublinear 4 days ago | parent [-] | | You'd have to run diff and sed before the formatter which is harder for everyone. | | |
|
|
| ▲ | Ygg2 3 days ago | parent | prev | next [-] |
| > would those who prefer tabs have any other arguments? Yes. Because Yaml exists. And mixing tabs and spaces is horrible in it. And the rules are very finnicky. Optimal tab usage is emit 2-4 spaces. |
|
| ▲ | froh 4 days ago | parent | prev [-] |
| yes, contemporary editors and tools like treesitter have decided this debate in favor of plain text file representation, exactly for the reasons you give: universal accessibility by general purpose tools. xslt was a Diana like pre-parsed representation of dsssl. oh how I miss dsssl (a scheme based sgml transformation language) but no. dsssl was a lisp! with hygienic macros! "ikes" they went and invented XSLT. the "logic" escapes me to this day. no. plain text it is. human readable. and grep/sed/diff able. |