Space Jam website design as an LLM benchmark.

This article is a bit negative. Claude gets close , it just can't get the order right which is something OP can manually fix.

I prefer GitHub Copilot because it's cheaper and integrates with GitHub directly. I'll have times where it'll get it right, and times when I have to try 3 or 4 times.

▲

GeoAtreides a day ago | parent | next [-]

>which is something OP can manually fix

what if the LLM gets something wrong that the operator (a junior dev perhaps) doesn't even know it's wrong? that's the main issue: if it fails here, it will fail with other things, in not such obvious ways.

▲

godelski a day ago | parent | next [-]

I think that's the main problem with them. It is hard to figure out when they're wrong.

As the post shows, you can't trust them when they think they solved something but you also can't trust them when they think they haven't[0]. The things are optimized for human preference, which ultimately results in this being optimized to hide mistakes. After all, we can't penalize mistakes in training when we don't know the mistakes are mistakes. The de facto bias is that we prefer mistakes that we don't know are mistakes than mistakes that we do[1].

Personally I think a well designed tool makes errors obvious. As a tool user that's what I want and makes tool use effective. But LLMs flip this on the head, making errors difficult to detect. Which is incredibly problematic.

[0] I frequently see this in a thing it thinks is a problem but actually isn't, which makes steering more difficult.

[1] Yes, conceptually unknown unknowns are worse. But you can't measure unknown unknowns, they are indistinguishable from knowns. So you always optimize deception (along with other things) when you don't have clear objective truths (most situations).

▲

alickz a day ago | parent | prev [-]

>what if the LLM gets something wrong that the operator (a junior dev perhaps) doesn't even know it's wrong?

the same thing that always happens if a dev gets something wrong without even knowing it's wrong - either code review/QA catches it, or the user does, and a ticket is created

>if it fails here, it will fail with other things, in not such obvious ways.

is infallibility a realistic expectation of a software tool or its operator?

	▲	GeoAtreides 17 hours ago \| parent [-]
		By sheer chance, there's now a HN submission that answers both (but mostly the second) questions PERFECTLY: https://news.ycombinator.com/item?id=46185957

▲

smallnix a day ago | parent | prev | next [-]

That's not the point of the article. It's about Claude/LLM being overconfident in recreating pixel perfect.

	▲	jacquesm a day ago \| parent [-]
		All AI's are overconfident. It's impressive what they can do, but it is at the same time extremely unimpressive what they can't do while passing it off as the best thing since sliced bread. 'Perfect! Now I see the problem.'. 'Thank you for correcting that, here is a perfect recreation of problem 'x' that will work with your hardware.' (never mind the 10 glaring mistakes). I've tried these tools a number of times and spent a good bit of effort on learning to maximize the return. By the time you know what prompt to write you've solved the problem yourself.

▲

bigstrat2003 a day ago | parent | prev | next [-]

> it just can't get the order right which is something OP can manually fix.

If the tool needs you to check up on it and fix its work, it's a bad tool.

▲

markbao a day ago | parent | next [-]

“Bad” seems extreme. The only way to pass the litmus test you’ve described is for a tool to be 100% perfect, so then the graph looks like 99.99% “bad tool” until it reaches 100% perfection.

It’s not that binary imo. It can still be extremely useful and save a ton of time if it does 90% of the work and you fix the last 10%. Hardly a bad tool.

It’s only a bad tool if you spent more time fixing the results than building it yourself, which sometimes used to be the case for LLMs but is happening less and less as they get more capable.

▲

a4isms a day ago | parent [-]

If you show me a tool that does a thing perfectly 99% of the time, I will stop checking it eventually. Now let me ask you: How do you feel about the people who manage the security for your bank using that tool? And eventually overlooking a security exploit?

I agree that there are domains for which 90% good is very, very useful. But 99% isn't always better. In some limited domains, it's actually worse.

	▲	a day ago \| parent \| next [-]
		[deleted]
	▲	999900000999 a day ago \| parent \| prev [-]
		Counterpoint. Humans don't get it right 100% or the time.

▲

godelski a day ago | parent | prev | next [-]

I wouldn't go that far, but I do believe good tool design tries to make its failure modes obvious. I like to think of it similar to encryption: hard to do, easy to verify.

All tools have failure modes and truthfully you always have to check the tool's work (which is your work). But being a master craftsman is knowing all the nuances behind your tools, where they work, and more importantly where they don't work.

That said, I think that also highlights the issue with LLMs and most AI. Their failure modes are inconsistent and difficult to verify. Even with agents and unit tests you still have to verify and it isn't easy. Most software bugs are created from subtle things, often which compound. Which both those things are the greatest weaknesses of LLMs: nuance and compounding effects.

So I still think they aren't great tools, but I do think they can be useful. But that also doesn't mean it isn't common for people to use them well outside the bounds of where they are generally useful. It'll be fine a lot of times, but the problem is that it is like an alcohol fire[0]; you don't know what's on fire because it is invisible. Which, after all, isn't that the hardest part of programming? Figuring out where the fire is?

[0] https://www.youtube.com/watch?v=5zpLOn-KJSE

▲

mrweasel a day ago | parent | prev | next [-]

That's my thinking. If I need to check up on the work, then I'm equally capable of writing the code myself. It might go faster with an LLM assisting me, and that feels perfectly fine. My issue is when people use the AI tools to generate something far beyond their own capabilities. In those cases, who checks the result?

▲

wvenable a day ago | parent | prev [-]

Perfection is the enemy of good.

▲

thecr0w a day ago | parent | prev [-]

ya, this is true. Another commenter also pointed out that my intention was to one-shot. I didn't really go too deeply into trying to try multiple iterations.

This is also fairly contrived, you know? It's not a realistic limitation to rebuild HTML from a screenshot because of course if I have the website loaded I can just download the HTML.

▲

swatcoder a day ago | parent | next [-]

> rebuild HTML from a screenshot

???

This is precisely the workflow when a traditional graphic designer mocks up a web/app design, which still happens all the time.

They sketch a design in something like Photoshop or Illustrator, because they're fluent in these tools and many have been using them for decades, and somebody else is tasked with figuring out how to slice and encode that design in the target interactive tech (HTML+CSS, SwiftUI, QT, etc).

Large companies, design agencies, and consultancies with tech-first design teams have a different workflow, because they intentionally staff graphic designers with a tighter specialization/preparedness, but that's a much smaller share of the web and software development space than you may think.

There's nothing contrived at all about this test and it's a really great demonstration of how tools like Claude don't take naturally to this important task yet.

	▲	thecr0w a day ago \| parent [-]
		You know, you're totally right and I didn't even think about that.

▲

Retric a day ago | parent | prev [-]

It’s not unrealistic to want to revert to an early version of something you only have a screenshot of.