> With hard requirements listed, I found out that the generated code missed requirements,

This is hardly a surprise, no? No matter how much training we run, we are still producing a generative model. And a generative model doesn't understand your requirements and cross them off. It predicts the next most likely token from a given prompt. If the most statistically plausible way to finish a function looks like a version that ignores your third requirement, the model will happily follow through. There's really no rules in your requirements doc. They are just the conditional events X in a glorified P(Y|X). I'd venture to guess that sometimes missing a requirement may increase the probability of the generated tokens, so the model will happily allow the miss. Actually, "allow" is too strong a word. The model does not allow shit. It just generates.

▲

teucris 6 hours ago | parent [-]

But agents do keep task lists and check the tasks off as they go. Of course it’s not perfect either but it’s MUCH better than an LLM can offer on its own.

If you are seeing an agent missing tasks, work with it to write down the task list first and then hold it accountable to completing them all. A spec is not a plan.

▲

mathisfun123 5 hours ago | parent [-]

bro do you really not understand that that's a game played for your sake - it checks boxes yes but you have no idea what effect the checking of the boxes actually has. like do you not realize/understand that anthropic/openai is baking this kind of stuff into models/UI/UX to give the sensation of rigor.

▲

jwitthuhn 2 hours ago | parent | next [-]

The checkboxes inform the model as well as the user, and you can observe this yourself. For example in a C++ project with MyClass defined in MyClass.cpp/h:

I ask the model to rename MyClass to MyNewClass. It will generate a checklist like:

- Rename references in all source files

- Rename source/header files

- Update build files to point at new source files

Then it will do those things in that order.

Now you can re-run it but inject the start of the model's response with the order changed in that list. It will follow the new order. The list plainly provides real information that influences future predictions and isn't just a facade for the user.

▲

_puk 5 hours ago | parent | prev [-]

Not to knee jerk on a bro comment, but, bro..

Are you seriously saying that breaking a large complex problem down into it's constituent steps, and then trying to solve each one of them as an individual problem is just a sensation of rigour?

▲

stvltvs 5 hours ago | parent | next [-]

I believe they're saying that the checkboxes are window dressing, not an accurate reflection of what the LLM has done.

▲

kazinator 5 hours ago | parent | prev | next [-]

To some extent, I could agree with that idea. One purpose of that process is to match the impedance between the problem, and human cognition. But that presumes problem solving inherently requires human cognition, which is false; that's just the tool that we have for problem solving. When the problem-solving method matches the cognitive strengths and weaknesses of the problem solvers, they do have a certain sensation of having an upper hand over the problem. Part of that comes from the chunking/division allowing the problem solvers to more easily talk about the problem; have conversations and narratives around it. The ability to spin coherent narratives feels like rigor.

▲

mathisfun123 5 hours ago | parent | prev [-]

I'm saying that's not what the stupid bot is actually doing, it's what anthropic added to the TUI to make you feel good in your feelies about what the bot is actually doing (spamming).

Edit: I'll give you another example that I realized because someone pointed it out here: when the stupid bot tells you why it fucked up, it doesn't actually understand anything about itself - it's just generating the most likely response given the enormous amount of pontification on the internet about this very subject...

▲

_puk 4 hours ago | parent [-]

I'm not disagreeing in principle, but the detritus left after an anthropic outage is usually quite usable in a completely fresh session. The amount of context pulled and stored in the sandbox is quite hefty.

Whist I can't usually start from the exact same point in the decisioning, I can usually bootstrap a new session. It's not all ephemeral.

To your edit: I find that the most galling thing about finding out about the thinking being discarded at cache clear. Reconstruction of the logical route it took to get to the end state is just not the same as the step by step process it took in the first place, which again I feel counters your "feelies".

	▲	mathisfun123 4 hours ago \| parent [-]
		> I find that the most galling thing about finding out about the thinking being discarded at cache clear There's a really simple solution to this galling sensation: simply always keep in mind it's a stupid GenAI chat bot.