LLMs as Compilers

▲ LLMs as Compilers(resync-games.com)

34 points by kadhirvelm a day ago | 57 comments

▲ daxfohl a day ago | parent | next [-]

Nah, I think that's the opposite of what to do. That requires you to specify all requirements up front, then press go and pray. Even if it worked perfectly, it takes us back to the stone ages of waterfall design. With LLMs, missing one requirement that would be obvious to a human (don't randomly delete accounts) often leads to a fun shortcut from the LLM perspective (hey if there's a race condition then I can fix it by deleting the account)!

The real value of LLMs is their conversational ability. Try something, iterate, try something else, iterate again, have it patch a bug you see, ask if it has recommendations based on where you are headed, flesh things out and fine tune them real time. Understand its misunderstandings and help it grasp the bigger picture.

Then at the end of the session, you'll have working code AND a detailed requirements document as an output. The doc will discuss the alternatives you tried along the way, and why you ended up where you did.

It's much like this in graphics too. Yeah you could spend a ton of time coming up with the single one-shot prompt that gives you something reasonably close to what you need, which is how it worked in the past. But now that approach is silly. It's much easier to work iteratively, change one thing, change another, until you have exactly what you need, in a much faster and more creative session.

So yeah you could use LLMs as a compiler, but it's so much more engaging not to.

▲ tamnd a day ago | parent [-]

Totally agree. It is why we're building Mochi, https://github.com/mochilang/mochi a small language that treats AI, datasets, and graph queries as first-class citizens, not just targets for code generation.

It's inspired by the evolution you mentioned: early compilers generating Assembly, now AI tools generating Python or SQL. Mochi leans into that by embedding declarative data queries, AI generation, and streaming logic directly into the language. Here is how it looks:

  type Person {
    name: string
    age: int
    email: string
  }

  let p = generate Person {
    prompt: "Generate a fictional software engineer"
  }
  print(p.name)

  let vec = generate embedding {
    text: "hello world"
    normalize: true
  }
  print(len(vec))

We see this as the natural next step after traditional compilers, more like intent compilers. The old "compiler to Assembly" phase now maps to "LLM prompt scaffolding" and prompt engineering is quickly becoming the new backend pass.

Would love feedback if this resonates with others building around AI + structured languages.

	▲	daxfohl 19 hours ago \| parent [-]
		Sounds like a fun project, but I have a hard time imagining it ever really catching on. I'd compare it to workflow managers. Lots of people created lots of DSLs around that, but nothing ever really caught on until Temporal, because nobody wants the cognitive overhead of needing to maintain a separate language just for workflows, especially if you don't know whether that language is going to still be around in five years. With Temporal, you write the workflow logic in whatever language you normally use, as an ordinary async function, and if you follow the rules, it just works. Even though it looks like an ordinary procedural function, it survives server reboots, can sleep for months, etc. I'd recommend dropping the new language approach ASAP, and shifting toward more of a Temporal-like approach. That said, Temporal does a lot under the hood and on the Temporal server side to make it worth the money. Here, I have a hard time seeing what this would provide beyond a "TReturnType LLMProxy.callAI<TReturnType>(string prompt)" function that sends the prompt and expected return type, and parses the response to the desired type. There's not even a need for a separate server tool, it's just a function. So IDK if there's a product there or not. Seems like you'd need to figure out a way to add more intrinsic value than just a library function. But I think the new language idea, while fun to work on, is probably not going to get very far in the real world.

▲ cschep a day ago | parent | prev | next [-]

Since LLM’s aren’t deterministic isn’t it impossible? What would keep it from iterating back and forth between two failing states forever? Is this the halting problem?

▲

daxfohl a day ago | parent | next [-]

I'd suggest the problem isn't that LLMs are nondeterministic. It's that English is.

With a coding language, once you know the rules, there's no two ways to understand the instructions. It does what it says. With English, good luck getting everyone and the LLM to agree on what every word means.

Going with LLM as a compiler, I expect by the time you get the English to be precise enough to be "compiled", the document will be many times larger than the resulting code, no longer be a reasonable requirements doc because it reads like code, but also inscrutable to engineers because it's so verbose.

▲

dworks a day ago | parent | next [-]

Sure, we cannot agree on the correct interpretation of the instructions. But, we also cannot define what is correct output.

First, the term “accuracy” is somewhat meaningless when it comes to LLMs. Anything that an LLM outputs is by definition “accurate” or “correct” from a technical point of view because it was produced by the model. The term accuracy then is not a technical or perhaps even factual term, but a sociological and cultural term, where what is right or wrong is determined by society, and even we sometimes have a hard time determining what is true or note (see: philosophy).

▲

miningape a day ago | parent | next [-]

What? What does philosophy have to do with anything?

If you cannot agree on the correct interpretation, nor output, what stops an LLM from solving the wrong problem? what stops an LLM from "compiling" the incorrect source code? What even makes it possible for us to solve a problem? If I ask an LLM to add a column to a table and it drops the table it's a critical failure - not something to be reinterpreted as a "new truth".

Philosophical arguments are fine when it comes to loose concepts like human language (interpretive domains). On the other hand computer languages are precise and not open to interpretation (formal domains) - so philosophical arguments cannot be applied to them (only applied to the human interpretation of code).

It's like how mathematical "language" (again a formal domain) describes precise rulesets (axioms) and every "fact" (theorem) is derived from them. You cannot philosophise your way out of the axioms being the base units of expression, you cannot philosophise a theorem into falsehood (instead you must show through precise mathematical language why a theorem breaks the axioms). This is exactly why programming, like mathematics, is a domain where correctness is objective and not something that can be waved away with philosophical reinterpretation. (This is also why the philosophy department is kept far away from the mathematics department)

▲

dworks a day ago | parent [-]

Looks like you misunderstood my comment. My point is that both input and output is too fuzzy for an LLM to be reliable in an automated system.

"Truth is one of the central subjects in philosophy." - https://plato.stanford.edu/entries/truth/

▲

miningape a day ago | parent [-]

Ah yes, that makes a lot more sense - I understood your comment as something like "the LLMs are always correct, we just need to redefine how programming languages work"

I think I made it halfway to your _actual_ point and then just missed it entirely.

> If you cannot agree on the correct interpretation, nor output, what stops an LLM from solving the wrong problem?

	▲	dworks 21 hours ago \| parent [-]
		Yep. I'm saying the problem is not just about interpreting and validating the output. You need to also interpret the question, since its in natural language rather than code, so its not just twice as hard but strictly impossible to reach a 100% accuracy with an LLM because you can't define what is correct in every case.

▲

codingdave a day ago | parent | prev [-]

It seems to me that we already have enough people using the "truth is subjective" arguments to defend misinformation campaigns. Maybe we don't need to expand it into even more areas. Those philosophical discussions are interesting in a classroom setting, but far less interesting when talking about real-world impact on people and society. Or perhaps "less interesting" is unfair, but when LLMs straight up get facts wrong, that is not the time for philosophical pontification about the nature of accuracy. They are just wrong.

	▲	dworks a day ago \| parent [-]
		I'm not making excuses for LLMs. I'm saying that when you have a non-deterministic system for which you have to evaluate all the output for correctness due to its impredictability, it is a practically impossible task.

▲

rickydroll a day ago | parent | prev [-]

Yes, in general, English is non-deterministic, e.g., reading a sentence with the absence or presence of an Oxford comma.

When I programmed for a living, I found coding quite tedious and preferred to start with a mix of English and mathematics, describing what I wanted to do, and then translate that text into code. When I discovered Literate Programming, it was significantly closer to my way of thinking. Literate programming was not without its shortcomings and lacked many aspects of programming languages we have come to rely on today.

Today, when I write small to medium-sized programs, it reads mostly like a specification, and it's not much bigger than the code itself. There are instances where I need to write a sentence or brief paragraph to prompt the LLM to generate correct code, but this doesn't significantly disrupt the flow of the document.

However, if this is going to be a practical approach, we will need a deterministic system that can use English and predicate calculus to generate reproducible software.

	▲	daxfohl 20 hours ago \| parent [-]
		Interesting, I'm the opposite! I far prefer to start off with a bit of code to help explore gotchas I might not have thought about and to help solidify my thoughts and approach. It doesn't have to be complete, or even compile. Just enough to identify the tradeoffs of whatever I'm doing. Once I have that, it's usually far easier to flesh out the details in the detailed design doc, or go back to the Product team and discuss conflicting or vague requirements, or opportunities for tweaks that could lead to more flexibility or whatever else. Then from there it's usually easier to get the rest of the team on the same page, as I feel I'll understand more concretely the tradeoffs that were made in the design and why. (Not saying one approach is better than the other. I just find the difference interesting).

▲

gloxkiqcza a day ago | parent | prev | next [-]

Correct me if I’m wrong but LLMs are deterministic, the randomness is added intentionally in the pipeline.

▲

mzl a day ago | parent | next [-]

LLMs can be run in a mostly deterministic mode (see https://docs.pytorch.org/docs/stable/notes/randomness.html for some info on running PyTorch programs).

Varying the deployment type (chip model, number of chips, batch size, ...) can also change the output due to rounding errors. See https://arxiv.org/abs/2506.09501 for some details on that.

▲

a day ago | parent | prev | next [-]

[deleted]

▲

zekica a day ago | parent | prev [-]

The two parts of your statement don't go together. A list of potential output tokens and their probabilities are generated deterministically but the actual token returned is then chosen at random (weighted based on the "temperature" parameter and the probability value).

▲

galaxyLogic a day ago | parent | next [-]

I assume they use software-based pseudo-random-number generators. Those can typically be given a seed-value which determines (deterministically) the sequence of random numbers that will be generated.

So if an LLM uses a seedable pseudo-random-number-generator for its random numbers, then it can be fully deterministic.

	▲	lou1306 a day ago \| parent [-]
		There are subtle sources of nondeterminism in concurrent floating point operations, especially on GPU. So even with a fixed seed, if an LLM encounters two tokens with very close likelihoods, it may pick one or the other across different runs. This has been observed even with temperature=0, which in principle does not involve _any_ randomness (see arXiv paper cited earlier in this thread).

▲

mzl a day ago | parent | prev [-]

That depends on the sampling strategy. Greedy sampling takes the max token at each step.

▲

pjmlp a day ago | parent | prev | next [-]

As much as many devs that haven't read the respective ISO standards, the compiler manual back to back, and then get surprised with UB based optimizations.

▲

mzl a day ago | parent | prev [-]

Many compilers are not deterministic (it is why repeatable builds is not a solved problem), and many LLMs can be run in a mostly deterministic way.

▲

miningape a day ago | parent [-]

Repeatable builds are not a requirement for determinism. Since the outputs can be determined based on the exact system running the code, it is deterministic - even though the output can vary based on the system running the code.

This is to say every output can be understood by understanding the systems that produced it. There are no dice rolls required. I.e. if it builds wrongly every other Tuesday, the reason for that can be determined (there's a line of code describing this logic).

	▲	rthnbgrredf 19 hours ago \| parent [-]
		While I don't disagree with your comment, I would say that a that a large language model, and a Docker build with a complex Dockerfile, where not every version is exactly pinned down, are quite similar. You might have updates from the base image, you might have updates from one of the thousands of dependencies. And each day you rebuild the image, you will get a different checksum. Similar to how you get different answers from the LLM. And just like you can get wrong answers from the LLM, you can also get Docker builds that start to behave differently over time. So this is how it often is in practice. Then there is the possibility to pin down every version, and also some large language models support temperature 0. This is more in the realm of determinism.

▲ fedeb95 a day ago | parent | prev | next [-]

> Democratize access to engineering

    You don't need as specialized skillsets to build complex apps, you just need to know how to put context together and iterate

I feel it is exactly the opposite. AI helps specialists iterate faster, knowing what they are doing. Who doesn't know the details will stumble upon problems unsolvable by AI iteration. Who knows the details can step in where AIs fail.

▲

klntsky a day ago | parent [-]

I don't think there are problems unsolvable in principle. Given a good enough specialist and some amount of time, it's possible to guide an LLM to the solution eventually.

The problem is that people often can't recognize whether they are getting closer to the solution or not, so iteration breaks.

	▲	fedeb95 5 hours ago \| parent \| next [-]
		about unsolvability, it depends. In some contexts there are undecidable problems, like the halting problem, but I get your point.
	▲	fedeb95 5 hours ago \| parent \| prev [-]
		yes that's my point. The article seems to suggest that the opposite is true, that without a specialist you can solve special problems with an unlearned user and a LLM.

▲ baalimago a day ago | parent | prev | next [-]

I've had the same exact thought! The reason why we've moved from higher to higher level of programming language is to make it easier for humans to describe to the machine what we want it to do. That's why languages are semantically easier and easier to read js > cpp > c > assembly > machine code (subjectively, yes yes, you get the point). It makes perfect sense to believe that natural language interpreted by an LLM is the next step in this evolution.

My prediction: in 10 years we'll see LLMs generate machine code directly, just like a normal compiler. The programming language will be the context provided by the context engineer.

▲

c048 a day ago | parent | next [-]

I've thought of this too.

But I always end up in a scenario where, in order to make the LLM spit out consistent and as precise as possible code, we end up with a very simple and tight syntax.

For example we'll be using less and less complete human sentences, because they leave too much open to interpretation, and end up with keywords like "if", "else" and "foreach". When we eventually do end up at that utopia, the first person to present this at a conference will be hailed as a revolutionist.

Only for the LLM to have a resolve clash and, while 'hallucinating', flip a boolean check.

▲

vbezhenar a day ago | parent | prev | next [-]

Normal compiler does not generate machine code directly. Normal compiler generates LLVM IR code. LLVM generates assembly listings. Assembler generates machine code. You can write compiler which outputs machine code directly, but this multi-level translation exists for a reason. IMO, LLM might be utilised to generate some Python code in the far far away future, if the issue with deterministic generation would be solved. But generating machine code does not make much sense. Today LLM uses external tools to compute sum of numbers, because they are so bad at deterministic calculations.

The core issue is that you need to be able to iterate on different parts of the application, either without altering unaffected parts or with deterministic translation. Otherwise, this AI application will be full of new bugs every change.

▲

baalimago a day ago | parent | next [-]

>if the issue with deterministic generation would be solved

This can be achieved by utilizing tests. So the SWE agent will write up a set of tests as it understands the task. These are the functional requirements, which should/could be easily inspected by the BI (biological intelligence).

Once the functional requirements have been set, the SWE agent can iterate over and over again until the tests pass. At this point it doesn't really matter what the solution code looks like or how it's written, only that the functional requirements as defined via the tests are upheld. New requirements? Additional tests.

	▲	kadhirvelm a day ago \| parent [-]
		Totally agree - I'd bet there will be a bigger emphasis on functional testing to prevent degradation of previously added features. And I'd bet the scope of tests we'll need to write will also go up. For example, I'd bet we'll need to add latency based unit tests to make sure as the LLM compiler is iterating, it doesn't make the user perceived performance worse

▲

pjmlp a day ago | parent | prev [-]

C and C++ UB enter the room,....

▲

gloxkiqcza a day ago | parent | prev | next [-]

I agree that the level of abstraction will grow and LLMs will be the primary tool to write code *but* I think they will still generate code in a formal language. That formal language might be very close to a natural language, pseudo code if you will, but it will still be a formal language. That will make it much much easier to work on, collaborate on and maintain the codebase. It’s just my prediction though, I might be proven wrong shortly.

	▲	lloeki a day ago \| parent [-]
		You seem to have missed this part of TFA: > That means we no longer examine the code. Our time as engineers will be spent handling context, testing features, and iterating on them IOW there would be no human to "work on, collaborate on and maintain the codebase" and so the premise of the article is that it might just as well emit machine code from the "source prompt", hence "LLM as compiler". Or maybe you mean that this formal language is not for humans to handle but entirely dedicated to LLMs, for the sake of LLMs not having to reverse engineer assembly? I think that's where the premises differ: the author seems to suggest that the assembly would be generated each time from the "source prompt" I don't know, these all read like thought experiments built on hypothetical properties that these AI tools would somehow be bestowed upon in some future and not something grounded in any reality. IOW science fiction.

▲

normalisticate a day ago | parent | prev | next [-]

> My prediction: in 10 years we'll see LLMs generate machine code directly, just like a normal compiler. The programming language will be the context provided by the context engineer.

Sure and if you run into a crash the system is just bricked.

This sort of wishful thinking glosses over decades of hard earned deterministic behavior in computers.

▲

arkh a day ago | parent | prev | next [-]

> It makes perfect sense to believe that natural language interpreted by an LLM is the next step in this evolution.

Which one? Most languages are full of imprecision and change over time. So which one would be best for giving instructions to the machines?

	▲	galaxyLogic a day ago \| parent [-]
		In the scheme described in the article the main input for AI would be the tests. If we are testing code outputs (and why not) the input then must be in a programming language. Specifications need to be unambiguous but Natural Language is often ambiguous.

▲

sjrd a day ago | parent | prev | next [-]

The level of abstraction of programming languages has been growing, yes. However, new languages have preserved precision and predictability. I would even argue that as we went up the abstraction ladder, we have increasingly improved the precision of the semantics of our languages. LLMs don't do that at all. They completely destroy determinism as a core design. Because of that, I really don't think LLMs will be the future of programming languages.

▲

kadhirvelm a day ago | parent | prev | next [-]

Interesting, I'd hypothesize something slightly different, that we'll see a much more efficient language come out. Something humans don't need to read that can then get compiled to machine code super efficiently. Basically optimizing the output tokens to machine work done as much as possible

▲

alaaalawi a day ago | parent | prev | next [-]

I concur. not intermediate code directly machine code, even no tests. It will take human specs internally understand them (maybe formal methods of reasoning) and keep chatting with the user asking about any gaps (you. mentioned a and c, what about b) or ask for clarification on inconsistencies (i.e. in point 16 you mentioned that and in point 50 you mentioned this, to my limited understanding doesn't this contradicts? for example if we have that based on point 16 and have that based point 50,how do you resolve it. in short will act as business analysis with no (imagined) ego or annoyance in the middle by the user. from talk to walk

▲

azaras a day ago | parent | prev [-]

But this is a waste of resources. LLM should be generated in a higher-level language and then compiled.

	▲	rvz a day ago \| parent [-]
		There you go. Then an actual compiler compiles the code into the correct low-level assembly for the actual linker to create an executable. Congratulations. An LLM is not a 'compiler'.

▲ careful_ai a day ago | parent | prev | next [-]

Love this framing—treating LLMs like compilers captures how engineers mentally iterate: code, check, refine. It’s not about one-shot prompts. It’s a loop of design, compile, analyze, debug. That mindset shift—seeing LLMs as thought compilers—might be the missing link for real developer adoption.

▲ fxj a day ago | parent | prev | next [-]

That not only works for compilation but also for more general code transformations like code parallelization with OpenMP in C and Fortran, Array-of-lists to List-of-arrays, or transform python code to parallel C-code and make a python module out of it.

I have created some pipelines this way where the LLM generates input files for a molecular dynamics code and write a python script for execution on a HPC system.

▲ iamgvj a day ago | parent | prev | next [-]

There was a literal LLM compiler model that was released last year by Meta.

https://arxiv.org/abs/2407.02524

▲ ryanobjc a day ago | parent | prev | next [-]

So this has already been conceived of many decades ago, and there are some substantial issues with it, the illustrious djikstra covers it: https://www.cs.utexas.edu/~EWD/transcriptions/EWD06xx/EWD667...

Now this isn’t to say the current programming languages are good, they are generally not. They don’t offer good abstraction powers, typically. You pay for this in extra lines of code.

But having to restate everything in English, then hoping that the LLM will fill in enough details, then iterating until you can close the gaps, well it doesn’t seem super efficient to me. You either cede control to what the LLM guesses or you spend a lot of natural language.

Certainly in a language with great abstractions you’d be fine already.

▲

quantumgarbage a day ago | parent | next [-]

Ah so I was right to scroll down to find a sane take

▲

nojito a day ago | parent | prev [-]

It’s no different from translating business requirements into code.

Djikstra was talking about something completely different.

	▲	sponnath a day ago \| parent \| next [-]
		True but this only works well if the natural language "processor" was reliable enough to properly translate business requirements into code. LLMs aren't there yet.
	▲	lou1306 a day ago \| parent \| prev [-]
		Exactly, and translating business requirements into code is so frustrating and error-prone that entire philosophies (and consulting firms) have been built around it. LLMs are no silver bullet, they are just faster to come up with _something_.

▲ a day ago | parent | prev | next [-]

[deleted]

▲ pjmlp a day ago | parent | prev | next [-]

Not necessarly this, however I am quite convinced that AI based tooling will be the evolution of compilers.

The current way to generate source code of existing programming languages, is only a transition step, akin to how early compilers always generated Assembly that was further processed by an existing Assembler.

Eventually most developers don't even know the magic incantations to spew Assembly of their compilers, including JITs, it has become a dark art for compiler engineers, game developers and crypto folks.

People that used to joke about COBOL, would be surprised how much effort is being spent in prompt engineering.

▲ satisfice 12 hours ago | parent | prev | next [-]

40404 error: Responsible skepticism not found.

▲ UltraSane a day ago | parent | prev [-]

This really only works well if you have a TLA+ style formal model of the algorithm and can use it to generate lots of unit tests.