new | show | ask | jobs Github

eaf7e281 5 hours ago

There's no way they actually work on training this.

▲

margalabargala 4 hours ago | parent | next [-]

I suspect they're training on this.

I asked Opus 4.6 for a pelican riding a recumbent bicycle and got this.

https://i.imgur.com/UvlEBs8.png

▲

WarmWash 4 hours ago | parent | next [-]

It would be way way better if they were benchmaxxing this. The pelican in the image (both images) has arms. Pelicans don't have arms, and a pelican riding a bike would use it's wings.

▲

ryandrake 3 hours ago | parent | next [-]

Having briefly worked in the 3D Graphics industry, I don't even remotely trust benchmarks anymore. The minute someone's benchmark performance becomes a part of the public's purchasing decision, companies will pull out every trick in the book--clean or dirty--to benchmaxx their product. Sometimes at the expense of actual real-world performance.

▲

seanhunter 3 hours ago | parent | prev [-]

Pelicans don’t ride bikes. You can’t have scruples about whether or not the image of a pelican riding a bike has arms.

▲

jevinskie 3 hours ago | parent [-]

Wouldn’t any decent bike-riding pelican have a bike tailored to pelicans and their wings?

	▲	actsasbuffoon an hour ago \| parent \| next [-]
		Sure, that’s one solution. You could also Isle of Dr Moreau your way to a pelican that can use a regular bike. The sky is the limit when you have no scruples.
	▲	cinntaile 3 hours ago \| parent \| prev [-]
		Now that would be a smart chat agent.

▲

mrandish 4 hours ago | parent | prev | next [-]

Interesting that it seems better. Maybe something about adding a highly specific yet unusual qualifier focusing attention?

▲

riffraff 3 hours ago | parent | prev [-]

perhaps try a penny farthing?

▲

KeplerBoy 5 hours ago | parent | prev | next [-]

There is no way they are not training on this.

	▲	4 hours ago \| parent \| next [-]
		[deleted]
	▲	collinmanderson 5 hours ago \| parent \| prev [-]
		I suspect they have generic SVG drawing that they focus on.

▲

fragmede 3 hours ago | parent | prev [-]

The people that work at Anthropic are aware of simonw and his test, and people aren't unthinking data-driven machines. How valid his test is or isn't, a better score on it is convincing. If it gets, say, 1,000 people to use Claude Code over Codex, how much would that be worth to Anthropic?

$200 * 1,000 = $200k/month.

I'm not saying they are, but to say that they aren't with such certainty, when money is on the line; unless you have some insider knowledge you'd like to share with the rest of the class, it seems like an questionable conclusion.