| ▲ | the_duke 4 hours ago |
| This doesn't make too much sense to me. * This isn't a language, it's some tooling to map specs to code and re-generate * Models aren't deterministic - every time you would try to re-apply you'd likely get different output (without feeding the current code into the re-apply and let it just recommend changes) * Models are evolving rapidly, this months flavour of Codex/Sonnet/etc would very likely generate different code from last months * Text specifications are always under-specified, lossy and tend to gloss over a huge amount of details that the code has to make concrete - this is fine in a small example, but in a larger code base? * Every non-trivial codebase would be made up of of hundreds of specs that interact and influence each other - very hard (and context - heavy) to read all specs that impact functionality and keep it coherent I do think there are opportunities in this space, but what I'd like to see is: * write text specifications * model transforms text into a *formal* specification * then the formal spec is translated into code which can be verified against the spec 2 and three could be merged into one if there were practical/popular languages that also support verification, in the vain of ADA/Spark. But you can also get there by generating tests from the formal specification that validate the implementation. |
|
| ▲ | onion2k 3 hours ago | parent | next [-] |
| Models aren't deterministic - every time you would try to re-apply you'd likely get different output (without feeding the current code into the re-apply and let it just recommend changes) If the result is always provably correct it doesn't matter whether or not it's different at the code level. People interested in systems like this believe that the outcome of what the code does is infinity more important than the code itself. |
| |
| ▲ | sensanaty 2 hours ago | parent | next [-] | | That if at the beginning of your sentence is doing a whole lot of work. Indeed, if we could formally and provably (another extremely loaded word) generate good code that'd be one thing, but proving correctness is one of those basically impossible tasks. | |
| ▲ | dsr_ 3 hours ago | parent | prev | next [-] | | Let's rephrase: Since nobody involved actually cares whether the code works or not, it doesn't matter whether it's a different wrong thing each time. | | |
| ▲ | brabel 20 minutes ago | parent [-] | | You got it completely backwards. The claim is that if the code does exactly what the spec says (which generated tests are supposed to "prove") then the actual code does not matter, even if it's different each time. |
| |
| ▲ | tomtomtom777 2 hours ago | parent | prev | next [-] | | > If the result is always provably correct it doesn't matter whether or not it's different at the code level. People interested in systems like this believe that the outcome of what the code does is infinity more important than the code itself. If the spec is so complete that it covers everything, you might as well write the code. The benefit of writing a spec and having the LLM code it, is that the LLM will fill in a lot of blanks. And it is this filling in of blanks that is non-deterministic. | | |
| ▲ | pjmlp 2 hours ago | parent [-] | | > If the spec is so complete that it covers everything, you might as well write the code. Welcome to the usual offshoring experience. |
| |
| ▲ | SpaceNoodled 3 hours ago | parent | prev | next [-] | | That's a huge "if." | | | |
| ▲ | FrankRay78 2 hours ago | parent | prev | next [-] | | Sure, but where are the formal acceptance tests to validate against? | |
| ▲ | 0-_-0 an hour ago | parent | prev | next [-] | | Besides, you can deterministically generate bad code, and not deterministically generate good code. | |
| ▲ | __loam 3 hours ago | parent | prev | next [-] | | The code is what the code does. | | |
| ▲ | kennywinker 2 hours ago | parent [-] | | The shoe is what the shoe does. Except one shoe is made by children in a fire-trap sweatshop with no breaks, and the other was made by a well paid adult in good working conditions. The ends don’t justify the means. The process of making impacts the output in ways that are subtle and important, but even holding the output as a fixed thing - the process of making still matters, at least to the people making it. | | |
| ▲ | pjmlp 2 hours ago | parent | next [-] | | Yet the people voting with their wallets seem to go with cheaper option, regardless of what hides behind it. Being shoes, offshoring, Webwidgets or AI generated code. | |
| ▲ | raw_anon_1111 2 hours ago | parent | prev [-] | | The end is whether the code meets the functional and non functional requirements. And guess how much shoe companies make who manufacture shoes in sweatshop conditions versus the ones who make artisanal handcrafted shoes? | | |
| ▲ | kennywinker 2 hours ago | parent | next [-] | | Ah yes - we should all strive to maximize shareholder value - triangle shirtwaist be damnned. Btw in my metaphor, we - the programmers - are the kids in the sweatshop. | | |
| ▲ | raw_anon_1111 2 hours ago | parent [-] | | If you are a “programmer” you are going to be the kids in the sweatshop. On the enterprise dev side where most developers work, it’s been headed in that direction for at least a decade where it was easy enough to become a “good enough” generic full stack/mobile/web etc dev. Even on the BigTech side being able to reverse a btree on the whiteboard and having on your resume that you were a mid level developer isn’t enough either anymore If you look at the comp on that side, it’s also stagnated for decade. AI has just accelerated that trend. While my job has been at various percentages to produce code for 30 years, it’s been well over a decade since I had to sell myself on “I codez real gud”. I sell myself as a “software engineer” who can go from ambiguous business and technical requirements, deal with politics, XYProblems, etc | | |
| ▲ | pjmlp 2 hours ago | parent [-] | | What do you think programmers in offshoring consulting shops are? Sadly. | | |
| ▲ | raw_anon_1111 an hour ago | parent [-] | | Exactly. I work in a consulting company as a customer facing staff consultant - highest level - specializing in cloud + app dev. We don’t hire anyone less than staff in the US. Anything lower is hired out of the country. That’s exactly my point. “Programming” was clearly becoming commoditized a decade ago. |
|
|
| |
| ▲ | uoaei an hour ago | parent | prev [-] | | Functional requirements are known knowns. Out of bounds behavior is sometimes a known unknown, but in the era of generated code is exclusively unknown unknowns. Good luck speccing out all the unanticipated side effects and undefined behaviors. Perhaps you can prompt the agent in a loop a bnumber of times but it's hard to believe that the brute-force throw-more-tokens-at-it approach has the same level of return as a more attentive audit by human eyeballs. | | |
| ▲ | raw_anon_1111 21 minutes ago | parent [-] | | Are you as a developer 100% able to trust that you didn’t miss anything? Your team if you are a team lead who delegates tasks to other developers? If you outsource non business things like Salesforce integrations etc do you know all of the code they wrote? Your library dependencies? Your infrastructure providers? |
|
|
|
| |
| ▲ | Copyrightest 3 hours ago | parent | prev | next [-] | | [dead] | |
| ▲ | jrm4 3 hours ago | parent | prev [-] | | I would be very comfortable with - re-run 100 times with different seeds. If the outcome is the same every time, you're reliably good to go. |
|
|
| ▲ | wenc 12 minutes ago | parent | prev | next [-] |
| Rehashing my comment from before: I use Kiro IDE (≠ Kiro CLI) primarily as a spec generator.
In my experience, it's high-quality for creating and iterating on specs. Tools like Cursor are optimized for human-driven vibing -- they have great autocomplete, etc. Kiro, by contrast, is optimized around spec, which ironically has been the most effective approach I've found for driving agents. I'd argue that Cursor, Antigravity, and similar tools are optimized for human steering, which explains their popularity, while Kiro is optimized for agent harnesses. That's also why it’s underused: it's quite opinionated, but very effective. Vibe-coding culture isn't sold on spec driven development (they think it's waterfall and summarily dismiss it -- even Yegge has this bias), so people tend to underrate it. Kiro writes specs using structured formats like EARS and INCOSE (which is the spc format used in places like Boeing for engineering reqs). It performs automated reasoning to check for consistency, then generates a design document and task list from the spec -- similar to what Beads does. I usually spend a significant amount of time pressure-testing the spec before implementing (often hours to days), and it pays off. Writing a good, consistent spec is essentially the computer equivalent of "writing as a tool of thought" in practice. Once the spec is tight, implementation tends to follow it closely. Kiro also generates property-based tests (PBTs) using Hypothesis in Python, inspired by Haskell's QuickCheck. These tests sweep the input domain and, when combined with traditional scenario-based unit tests, tend to produce code that adheres closely to the spec. I also add a small instruction "do red/green TDD" (I learned this from Simon Willison) and that one line alone improved the quality of all my tests.
Kiro can technically implement the task list itself, but this is where agents come in. With the spec in hand, I use multiple headless CLI agents in tmux (e.g., Kiro CLI, Claude Code) for implementation. The results have been very good. With a solid Kiro spec and task list, agents usually implement everything end-to-end without stopping -- I haven’t found a need for Ralph loops. (agents sometimes tend to stop mid way on Claude plans, but I've never had that happen with Kiro, not sure why, maybe it's the checklist, which includes PBT tests as gates). didn't have the strongest start, but the Kiro IDE is one of the best spec generators I've used, and it integrates extremely well with agent-driven workflows. |
|
| ▲ | pron 2 hours ago | parent | prev | next [-] |
| If what you're after is determinism, then your solution doesn't offer it. Both the formal specification and the code generated from it would be different each time. Formal specifications are useful when they're succinct, which is possible when they specify at a higher level of abstraction than code, which admits many different implemementations. |
| |
| ▲ | vidarh an hour ago | parent [-] | | The point would presumably be to formalise it, then verify that the formal version matches what you actually meant. At which point you can't/shouldn't regenerate it, but you can request changes (which you'd need to verify and approve). | | |
| ▲ | pron an hour ago | parent [-] | | But the code produced from the formal spec would still be nondeterministic. And I believe CodeSpeak doesn't wish to regenerate the entire program with each spec change, but apply code changes based on the changes to the spec. Maybe there could be other benefits to formalisation in this case, but determinism isn't one of them. | | |
| ▲ | vidarh an hour ago | parent [-] | | It doesn't matter if the code is different if the spec is formal enough to validate the software against it. I have no idea about codespeak - I was responding to the comments above, not about codespeak. | | |
| ▲ | pron a minute ago | parent [-] | | Validating programs against a formal spec is very, very hard for foundational computational complexity reasons. There's a reason why the largest programs whose code was fully verified against a formal spec, and at an enormous cost, were ~10KLOC. If you want to do it using proofs, then lines of proof outnumber lines of code 10-1000 to 1, and the work is far harder than for proofs in mathematics (that are typically much shorter). There are less absolute ways of checking spec conformance at some useful level of confidence, and they can be worthwhile, but they require expertise and care (I'm very much in favour of using them, but the thought that AI can "just" prove conformance to a formal spec ignores the computational complexity results in that field). |
|
|
|
|
|
| ▲ | DrJokepu 3 hours ago | parent | prev | next [-] |
| > Models aren't deterministic Is that really true? I haven’t tried to do my own inference since the first Llama models came out years ago, but I am pretty sure it was deterministic: if you fixed the seed and the input was the same, the output of the inference was always exactly the same. |
| |
| ▲ | bigwheels 3 hours ago | parent [-] | | LLMs are not deterministic: 1.) There is typically a temperature setting (even when not exposed, most major providers have stopped exposing it [esp in the TUIs]). 2.) Then, even with the temperature set to 0, it will be almost deterministic but you'll still observe small variations due to the limited precision of float numbers. Edit: thanks for the corrections | | |
| ▲ | dwohnitmok 2 hours ago | parent | next [-] | | > but you'll still observe small variations due to the limited precision of float numbers No. Floating number arithmetic is deterministic. You don't get different answers for the same operations on the same machine just because of limited precision. There are reasons why it can be difficult to make sure that floating point operations agree across machines, but that is more of a (very annoying and difficult to make consistent) configuration thing than determinism. (In general it is mildly frustrating to me to see software developers treat floating point as some sort of magic and ascribe all sorts of non-deterministic qualities to it. Yes floating point configuration for consistent results across machines can be absurdly annoying and nigh-impossible if you use transcendental functions and different binaries. No this does not mean if your program is giving different results for the same input on the same machine that this is a floating point issue). In theory parallel execution combined with non-associativity can cause LLM inference to be non-deterministic. In practice that is not the case. LLM forward passes rarely use non-deterministic kernels (and these are usually explicitly marked as such e.g. in PyTorch). You may be thinking of non-determinism caused by batching where different batch sizes can cause variations in output. This is not strictly speaking non-determinism from the perspective of the LLM, but is effectively non-determinism from the perspective of the end user, because generally the end user has no control over how a request is slotted into a batch. | |
| ▲ | comboy 2 hours ago | parent | prev | next [-] | | Limited precision of float numbers is deterministic. But there's whole parallelism and how things are wired together, your generation may end up on a different hardware etc. And models I work with (claude,gemini etc) have the temperature parameter when you are using API. | |
| ▲ | the_duke an hour ago | parent | prev [-] | | You shouldn't be downvoted - LLMs could in theory be deterministic, but they currently are not, due to how models are implemented. |
|
|
|
| ▲ | davedx 4 hours ago | parent | prev | next [-] |
| My process has organically evolved towards something similar but less strictly defined: - I bootstrap AGENTS.md with my basic way of working and occasionally one or two project specific pieces - I then write a DESIGN.md. How detailed or well specified it is varies from project to project: the other day I wrote a very complete DESIGN.md for a time tracking, invoice management and accounting system I wanted for my freelance biz. Because it was quite complete, the agent almost one-shot the whole thing - I often also write a TECHNICAL-SPEC.md of some kind. Again how detailed varies. - Finally I link to those two from the AGENTS. I also usually put in AGENTS that the agent should maintain the docs and keep them in sync with newer decisions I make along the way. This system works well for me, but it's still very ad hoc and definitely doesn't follow any kind of formally defined spec standard. And I don't think it should, really? IMO, technically strict specs should be in your automated tests not your design docs. |
| |
| ▲ | the_duke 3 hours ago | parent | next [-] | | I think many have adopted "spec driven development" in the way you describe. I found it works very well in once-off scenarios, but the specs often drift from the implementation.
Even if you let the model update the spec at the end, the next few work items will make parts of it obsolete. Maybe that's exactly the goal that "codespeak" is trying to solve, but I'm skeptical this will work well without more formal specifications in the mix. | | |
| ▲ | intrasight 3 hours ago | parent [-] | | > specs often drift from the implementation
> Maybe that's exactly the goal that "codespeak" is trying to solve Yes and yes. I think it's an important direction in software engineering. It's something that people were trying to do a couple decades ago but agentic implementation of the spec makes it much more practical. |
| |
| ▲ | jbonatakis 3 hours ago | parent | prev | next [-] | | I have been building this in my free time and it might be relevant to you: https://github.com/jbonatakis/blackbird I have the same basic workflow as you outlined, then I feed the docs into blackbird, which generates a structured plan with task and sub tasks. Then you can have it execute tasks in dependency order, with options to pause for review after each task or an automated review when all child task for a given parents are complete. It’s definitely still got some rough edges but it has been working pretty well for me. | |
| ▲ | rebolek 3 hours ago | parent | prev [-] | | AGENTS.md is nice but I still need to remind models that it exists and they should read it and not reinvent the wheel every time. | | |
| ▲ | allthetime 3 hours ago | parent [-] | | There should be a setting to include specific files in every prompt/context. I’m using zed and when you fire up an agent / chat it explicitly states that the file(s) are included. |
|
|
|
| ▲ | jnpnj an hour ago | parent | prev | next [-] |
| Maybe we're entering the non-deterministic applications too. No more mechanical predictable thing.. more like 90% regular and then weird. Slightly sarcastic but not sure this couldn't become a thing. |
|
| ▲ | rco8786 2 hours ago | parent | prev | next [-] |
| How is your 2 step process not susceptible to all the exact same pitfalls you listed above? |
|
| ▲ | 2 hours ago | parent | prev | next [-] |
| [deleted] |
|
| ▲ | dist-epoch 2 hours ago | parent | prev | next [-] |
| > Models aren't deterministic - every time you would try to re-apply you'd likely get different output So like when you give the same spec to 2 different programmers. |
| |
| ▲ | rco8786 2 hours ago | parent | next [-] | | Yes, if you had each programmer rewrite the code from scratch each time you updated the spec. | | |
| ▲ | orbital-decay an hour ago | parent [-] | | In reality you give the same programmer an update to the existing spec, and they change the code to implement the difference. Which is exactly what the thing in OP is doing, and exactly what should be done. There's simply no reason to regenerate the result. The entire thing about determinism is a red herring, because 1) it's not determinism but prompt instability, and 2) prompt instability doesn't matter because of the above. Intelligence (both human and machine) is not a formal domain, your inputs lack formal syntax, and that's fine. For some reason this basic concept creates endless confusion everywhere. |
| |
| ▲ | dboreham an hour ago | parent | prev | next [-] | | Also like this: https://codeassociates.github.io/conversations-with-claude/c... | |
| ▲ | kennywinker 2 hours ago | parent | prev [-] | | Except each time you compile your spec you’re re-writing it from scratch with a different programmer. |
|
|
| ▲ | pessimizer 3 hours ago | parent | prev | next [-] |
| I think your objections miss the point. My informal specs to a program are user-focused. I want to dictate what benefits the program will give to the person who is using it, which may include requirements for a transport layer, a philosophy of user interaction, or any number of things. When I know what I want out of a program, I go through the agony of translating that into a spec with database schemas, menu options, specific encryption schemes, etc., then finally I turn that into a formal spec within which whether I use an underscore or a dash somewhere becomes a thing that has to be consistent throughout the document. You're telling me that I should be doing the agonizing parts in order for the LLM to do the routine part (transforming a description of a program into a formal description of a program.) Your list of things that "make no sense" are exactly the things that I want the LLMs to do. I want to be able to run the same spec again and see the LLM add a feature that I never expected (and wasn't in the last version run from the same spec) or modify tactics to accomplish user goals based on changes in technology or availability of new standards/vendors. I want to see specs that move away from describing the specific functionality of programs altogether, and more into describing a usefulness or the convenience of a program that doesn't exist. I want to be able to feed the LLM requirements of what I want a program to be able to accomplish, and let the LLM research and implement the how. I only want to have to describe constraints i.e. it must enable me to be able to do A, B, and C, it must prevent X,Y, and Z; I want it to feel free to solve those constraints in the way it sees fit; and when I find myself unsatisfied with the output, I'll deliver it more constraints and ask it to regenerate. |
| |
| ▲ | darkwater 3 hours ago | parent [-] | | > I want to be able to run the same spec again and see the LLM add a feature that I never expected (and wasn't in the last version run from the same spec) or modify tactics to accomplish user goals based on changes in technology or availability of new standards/vendors. Be careful what you wish for. This sounds great in theory but in practice it will probably mean a migration path for the users (UX changes, small details changed, cost dynamics and a large etc.) |
|
|
| ▲ | fnord77 3 hours ago | parent | prev | next [-] |
| [delete] |
| |
|
| ▲ | hkonte 3 hours ago | parent | prev [-] |
| [dead] |