I just tried to get Gemini to produce an image of a dog with 5 legs to test this out, and it really struggled with that. It either made a normal dog, or turned the tail into a weird appendage.

Then I asked both Gemini and Grok to count the legs, both kept saying 4.

Gemini just refused to consider it was actually wrong.

Grok seemed to have an existential crisis when I told it it was wrong, becoming convinced that I had given it an elaborate riddle. After thinking for an additional 2.5 minutes, it concluded: "Oh, I see now—upon closer inspection, this is that famous optical illusion photo of a "headless" dog. It's actually a three-legged dog (due to an amputation), with its head turned all the way back to lick its side, which creates the bizarre perspective making it look decapitated at first glance. So, you're right; the dog has 3 legs."

You're right, this is a good test. Right when I'm starting to feel LLMs are intelligent.

▲ macNchz an hour ago | parent | next [-]

An interesting test in this vein that I read about in a comment on here is generating a 13 hour clock—I tried just about every prompting trick and clever strategy I could come up with across many image models with no success. I think there's so much training data of 12 hour clocks that just clobbers the instructions entirely. It'll make a regular clock that skips from 11 to 13, or a regular clock with a plaque saying "13 hour clock" underneath, but I haven't gotten an actual 13 hour clock yet.

	▲	RestartKernel 42 minutes ago \| parent [-]
		Right you are. It can do 26 hours just fine, but appears completely incapable when the layout would be too close to a normal clock. https://gemini.google.com/share/b3b68deaa6e6 I thought giving it a setting would help, but just skip that first response to see what I mean.

▲ vunderba 2 hours ago | parent | prev | next [-]

If you want to see something rather amusing - instead of using the LLM aspect of Gemini 3.0 Pro, feed a five-legged dog directly into Nano Banana Pro and give it an editing task that requires an intrinsic understanding of the unusual anatomy.

  Place sneakers on all of its legs.

It'll get this correct a surprising number of times (tested with BFL Flux2 Pro, and NB Pro).

https://imgur.com/a/wXQskhL

▲ dwringer 4 hours ago | parent | prev | next [-]

I had no trouble getting it to generate an image of a five-legged dog first try, but I really was surprised at how badly it failed in telling me the number of legs when I asked it in a new context, showing it that image. It wrote a long defense of its reasoning and when pressed, made up demonstrably false excuses of why it might be getting the wrong answer while still maintaining the wrong answer.

	▲	Rover222 3 hours ago \| parent [-]
		Yeah it gave me the 5-legged dog on the 4th or 5th try.

▲ AIorNot 4 hours ago | parent | prev | next [-]

Its not that they aren’t intelligent its that they have been RL’d like crazy to not do that

Its rather like as humans we are RL’d like crazy to be grossed out if we view a picture of a handsome man and beautiful woman kissing (after we are told they are brother and sister) -

Ie we all have trained biases - that we are told to follow and trained on - human art is about subverting those expectations

▲

majormajor 3 hours ago | parent [-]

Why should I assume that a failure that looks like a model just doing fairly simple pattern matching "this is dog, dogs don't have 5 legs, anything else is irrelevant" vs more sophisticated feature counting of a concrete instance of an entity is RL vs just a prediction failure due to training data not containing a 5-legged dog and an inability to go outside-of-distribution?

RL has been used extensively in other areas - such as coding - to improve model behavior on out-of-distribution stuff, so I'm somewhat skeptical of handwaving away a critique of a model's sophistication by saying here it's RL's fault that it isn't doing well out-of-distribution.

If we don't start from a position of anthropomorphizing the model into a "reasoning" entity (and instead have our prior be "it is a black box that has been extensively trained to try to mimic logical reasoning") then the result seems to be "here is a case where it can't mimic reasoning well", which seems like a very realistic conclusion.

	▲	mlinhares 3 hours ago \| parent \| next [-]
		I have the same problem, people are trying so badly to come up with reasoning for it when there's just nothing like that there. It was trained on it and it finds stuff it was trained to find, if you go out of the training it gets lost, we expect it to get lost.
	▲	didgeoridoo 2 hours ago \| parent \| prev [-]
		I’m inclined to buy the RL story, since the image gen “deep dream” models of ~10 years ago would produce dogs with TRILLIONS of eyes: https://doorofperception.com/2015/10/google-deep-dream-incep...

▲ irthomasthomas 4 hours ago | parent | prev | next [-]

Isn't this proof that LLMs still don't really generalize beyond their training data?

▲

adastra22 2 hours ago | parent | next [-]

LLMs are very good at generalizing beyond their training (or context) data. Normally when they do this we call it hallucination.

Only now we do A LOT of reinforcement learning afterwards to severely punish this behavior for subjective eternities. Then act surprised when the resulting models are hesitant to venture outside their training data.

▲

Zambyte 2 hours ago | parent | prev | next [-]

I wonder how they would behave given a system prompt that asserts "dogs may have more or less than four legs".

	▲	irthomasthomas an hour ago \| parent [-]
		That may work but what actual use would it be? You would be plugging one of a million holes. A general solution is needed.

▲

CamperBob2 3 hours ago | parent | prev | next [-]

They do, but we call it "hallucination" when that happens.

▲

Rover222 3 hours ago | parent | prev [-]

Kind of feels that way

▲ varispeed 40 minutes ago | parent | prev | next [-]

Do 7 legged dog. Game over.

▲ qnleigh an hour ago | parent | prev [-]

It's not obvious to me whether we should count these errors as failures of intelligence or failures of perception. There's at least a loose analogy to optical illusion, which can fool humans quite consistently. Now you might say that a human can usually figure out what's going on and correctly identify the illusion, but we have the luxury of moving our eyes around the image and taking it in over time, while the model's perception is limited to a fixed set of unchanging tokens. Maybe this is relevant.

(Note I'm not saying that you can't find examples of failures of intelligence. I'm just questioning whether this specific test is an example of one).

▲

cyanmagenta an hour ago | parent [-]

I am having trouble understanding the distinction you’re trying to make here. The computer has the same pixel information that humans do and can spend its time analyzing it in any way it wants. My four-year-old can count the legs of the dog (and then say “that’s silly!”), whereas LLMs have an existential crisis because five-legged-dogs aren’t sufficiently represented in the training data. I guess you can call that perception if you want, but I’m comfortable saying that my kid is smarter than LLMs when it comes to this specific exercise.

	▲	FeepingCreature 16 minutes ago \| parent [-]
		Your kid, it should be noted, has a massively bigger brain than the LLM. I think the surprising thing here maybe isn't that the vision models don't work well in corner cases but that they work at all. Also my bet would be that video capable models are better at this.