Your LLM output seems abnormally bad, like you are using old models, bad models, or intentionally poor prompting. I just copied and pasted your Krita example into ChatGPT, and reasonable answer, nothing like what you paraphrased in your post.

https://imgur.com/a/O9CjiJY

▲

yosefk 3 days ago | parent | next [-]

The examples are from the latest versions of ChatGPT, Claude, Grok, and Google AI Overview. I did not bother to list the full conversations because (A) LLMs are very verbose and (B) nothing ever reproduces, so in any case any failure is "abnormally bad." I guess dismissing failures and focusing on successes is a natural continuation of our industry's trend to ship software with bugs which allegedly don't matter because they're rare, except with "AI" the MTBF is orders of magnitude shorter

▲

typpilol 8 days ago | parent | prev | next [-]

This seems like a common theme with these types of articles

▲

eru 8 days ago | parent [-]

Perhaps the people who get decent answers don't write articles about them?

▲

ehnto 8 days ago | parent [-]

I imagine people give up silently more often than they write a well syndicated article about it. The actual adoption and efficiencies we see in enterprises will be the most verifiable data on if LLMs are generally useful in practice. Everything so far is just academic pontificating or anecdata from strangers online.

	▲	eru 8 days ago \| parent \| next [-]
		I am inclined to agree. However, I'm not completely sure. Eg object oriented programming was basically a useless fad full of empty, never-delivered-on promises, but software companies still lapped it up. (If you happen to like OOP, you can probably substitute your own favourite software or wider management fad.) Another objection: even an LLM with limited capabilities and glaring flaws can still be useful for some commercial use-cases. Eg the job of first line call centre agents that aren't allowed to deviate from a fixed script can be reasonable automated with even a fairly bad LLM. Will it suck occasionally? Of course! But so does interacting with the humans placed into these positions without authority to get anything done for you. So if the bad LLM is cheaper, it might be worthwhile.
	▲	libraryofbabel 8 days ago \| parent \| prev [-]
		This. I think we’ve about reached the limit of the usefulness of anecdata “hey I asked an LLM this this and this” blog posts. We really need more systematic large scale data and studies on the latest models and tools - the recent one on cursor (which had mixed results) was a good start but it was carried out before Claude Code was even released, i.e. prehistoric times in terms of AI coding progress. For my part I don’t really have a lot of doubts that coding agents can be a useful productivity boost on real-world tasks. Setting aside personal experience, I’ve talked to enough developers at my company using them for a range of tickets on a large codebase to know that they are. The question is more, how much: are we talking a 20% boost, or something larger, and also, what are the specific tasks they’re most useful on. I do hope in the next few years we can get some systematic answers to that as an industry, that go beyond people asking LLMs random things and trying to reason about AI capabilities from first principles.

▲

marcellus23 8 days ago | parent | prev [-]

I think it's hard to take any LLM criticism seriously if they don't even specify which model they used. Saying "an LLM model" is totally useless for deriving any kind of conclusion.

▲

ehnto 8 days ago | parent | next [-]

When talking about the capabilities of a class of tools long term, it makes sense to be general. I think deriving conclusions at all is pretty difficult given how fast everything is moving, but there is some realities we do actually know about how LLMs work and we can talk about that.

Knowing that ChatGPT output good tokens last tuesday but Sonnet didn't does not help us know much about the future of the tools on general.

▲

dpoloncsak 6 days ago | parent [-]

> Knowing that ChatGPT output good tokens last tuesday but Sonnet didn't does not help us know much about the future of the tools on general.

Isnt that exactly what is going to help us understand the value these tools bring to end-users, and how to optimize these tools for better future use? None of these models are copy+pastes, they tend to be doing things slightly differently under the hood. How those differences affect results seems like the exact data we would want here

▲

ehnto 6 days ago | parent [-]

I guess I disagree that the main concern is the differences per each model, rather than the overall technology of LLMs in general. Given how fast it's all changing, I would rather focus on the broader conversation personally. I don't really care if GPT5 is better at benchmarks, I care that LLMs are actually capable of the type of reasoning and productive output that the world currently thinks they are.

	▲	marcellus23 5 days ago \| parent [-]
		Sure, but if you're making a point about LLMs in general, you need to use examples from best-in-class models. Otherwise your examples of how these models fail are meaningless. It would be like complaining about how smartphone cameras are inherently terrible, but all your examples of bad photos aren't labeled with what phone was used to capture. How can anyone infer anything meaningful from that?

▲

p1esk 8 days ago | parent | prev [-]

Yes, I’d be curious about his experience with GPT-5 Thinking model. So far I haven’t seen any blunders from it.

▲

eru 8 days ago | parent [-]

I've seen plenty of blunders, but in general it's better than their previous models.

Well, it depends a bit on what you mean by blunders. But eg I've seen it confidently assert mathematically wrong statements with nonsense proofs, instead of admitting that it doesn't know.

▲

grey-area 8 days ago | parent [-]

In a very real sense it doesn’t even know that it doesn’t know.

	▲	eru 8 days ago \| parent [-]
		Maybe. But in math you can either produce the proof (with each step checkable) or you can't.