Well

It is the first model to get partial-credit on an LLM image test I have. Which is counting the legs of a dog. Specifically, a dog with 5 legs. This is a wild test, because LLMs get really pushy and insistent that the dog only has 4 legs.

In fact GPT5 wrote an edge detection script to see where "golden dog feet" met "bright green grass" to prove to me that there were only 4 legs. The script found 5, and GPT-5 then said it was a bug, and adjusted the script sensitivity so it only located 4, lol.

Anyway, Gemini 3, while still being unable to count the legs first try, did identify "male anatomy" (it's own words) also visible in the picture. The 5th leg was approximately where you could expect a well endowed dog to have a "5th leg".

That aside though, I still wouldn't call it particularly impressive.

As a note, Meta's image slicer correctly highlighted all 5 legs without a hitch. Maybe not quite a transformer, but interesting that it could properly interpret "dog leg" and ID them. Also the dog with many legs (I have a few of them) all had there extra legs added by nano-banana.

▲ Rover222 4 hours ago | parent | next [-]

I just tried to get Gemini to produce an image of a dog with 5 legs to test this out, and it really struggled with that. It either made a normal dog, or turned the tail into a weird appendage.

Then I asked both Gemini and Grok to count the legs, both kept saying 4.

Gemini just refused to consider it was actually wrong.

Grok seemed to have an existential crisis when I told it it was wrong, becoming convinced that I had given it an elaborate riddle. After thinking for an additional 2.5 minutes, it concluded: "Oh, I see now—upon closer inspection, this is that famous optical illusion photo of a "headless" dog. It's actually a three-legged dog (due to an amputation), with its head turned all the way back to lick its side, which creates the bizarre perspective making it look decapitated at first glance. So, you're right; the dog has 3 legs."

You're right, this is a good test. Right when I'm starting to feel LLMs are intelligent.

▲ macNchz an hour ago | parent | next [-]

An interesting test in this vein that I read about in a comment on here is generating a 13 hour clock—I tried just about every prompting trick and clever strategy I could come up with across many image models with no success. I think there's so much training data of 12 hour clocks that just clobbers the instructions entirely. It'll make a regular clock that skips from 11 to 13, or a regular clock with a plaque saying "13 hour clock" underneath, but I haven't gotten an actual 13 hour clock yet.

	▲	RestartKernel an hour ago \| parent [-]
		Right you are. It can do 26 hours just fine, but appears completely incapable when the layout would be too close to a normal clock. https://gemini.google.com/share/b3b68deaa6e6 I thought giving it a setting would help, but just skip that first response to see what I mean.

▲ vunderba 2 hours ago | parent | prev | next [-]

If you want to see something rather amusing - instead of using the LLM aspect of Gemini 3.0 Pro, feed a five-legged dog directly into Nano Banana Pro and give it an editing task that requires an intrinsic understanding of the unusual anatomy.

  Place sneakers on all of its legs.

It'll get this correct a surprising number of times (tested with BFL Flux2 Pro, and NB Pro).

https://imgur.com/a/wXQskhL

▲ dwringer 4 hours ago | parent | prev | next [-]

I had no trouble getting it to generate an image of a five-legged dog first try, but I really was surprised at how badly it failed in telling me the number of legs when I asked it in a new context, showing it that image. It wrote a long defense of its reasoning and when pressed, made up demonstrably false excuses of why it might be getting the wrong answer while still maintaining the wrong answer.

	▲	Rover222 3 hours ago \| parent [-]
		Yeah it gave me the 5-legged dog on the 4th or 5th try.

▲ AIorNot 4 hours ago | parent | prev | next [-]

Its not that they aren’t intelligent its that they have been RL’d like crazy to not do that

Its rather like as humans we are RL’d like crazy to be grossed out if we view a picture of a handsome man and beautiful woman kissing (after we are told they are brother and sister) -

Ie we all have trained biases - that we are told to follow and trained on - human art is about subverting those expectations

▲

majormajor 3 hours ago | parent [-]

Why should I assume that a failure that looks like a model just doing fairly simple pattern matching "this is dog, dogs don't have 5 legs, anything else is irrelevant" vs more sophisticated feature counting of a concrete instance of an entity is RL vs just a prediction failure due to training data not containing a 5-legged dog and an inability to go outside-of-distribution?

RL has been used extensively in other areas - such as coding - to improve model behavior on out-of-distribution stuff, so I'm somewhat skeptical of handwaving away a critique of a model's sophistication by saying here it's RL's fault that it isn't doing well out-of-distribution.

If we don't start from a position of anthropomorphizing the model into a "reasoning" entity (and instead have our prior be "it is a black box that has been extensively trained to try to mimic logical reasoning") then the result seems to be "here is a case where it can't mimic reasoning well", which seems like a very realistic conclusion.

	▲	mlinhares 3 hours ago \| parent \| next [-]
		I have the same problem, people are trying so badly to come up with reasoning for it when there's just nothing like that there. It was trained on it and it finds stuff it was trained to find, if you go out of the training it gets lost, we expect it to get lost.
	▲	didgeoridoo 2 hours ago \| parent \| prev [-]
		I’m inclined to buy the RL story, since the image gen “deep dream” models of ~10 years ago would produce dogs with TRILLIONS of eyes: https://doorofperception.com/2015/10/google-deep-dream-incep...

▲ irthomasthomas 4 hours ago | parent | prev | next [-]

Isn't this proof that LLMs still don't really generalize beyond their training data?

▲

adastra22 2 hours ago | parent | next [-]

LLMs are very good at generalizing beyond their training (or context) data. Normally when they do this we call it hallucination.

Only now we do A LOT of reinforcement learning afterwards to severely punish this behavior for subjective eternities. Then act surprised when the resulting models are hesitant to venture outside their training data.

▲

Zambyte 2 hours ago | parent | prev | next [-]

I wonder how they would behave given a system prompt that asserts "dogs may have more or less than four legs".

	▲	irthomasthomas an hour ago \| parent [-]
		That may work but what actual use would it be? You would be plugging one of a million holes. A general solution is needed.

▲

CamperBob2 3 hours ago | parent | prev | next [-]

They do, but we call it "hallucination" when that happens.

▲

Rover222 3 hours ago | parent | prev [-]

Kind of feels that way

▲ varispeed 43 minutes ago | parent | prev | next [-]

Do 7 legged dog. Game over.

▲ qnleigh 2 hours ago | parent | prev [-]

It's not obvious to me whether we should count these errors as failures of intelligence or failures of perception. There's at least a loose analogy to optical illusion, which can fool humans quite consistently. Now you might say that a human can usually figure out what's going on and correctly identify the illusion, but we have the luxury of moving our eyes around the image and taking it in over time, while the model's perception is limited to a fixed set of unchanging tokens. Maybe this is relevant.

(Note I'm not saying that you can't find examples of failures of intelligence. I'm just questioning whether this specific test is an example of one).

▲

cyanmagenta an hour ago | parent [-]

I am having trouble understanding the distinction you’re trying to make here. The computer has the same pixel information that humans do and can spend its time analyzing it in any way it wants. My four-year-old can count the legs of the dog (and then say “that’s silly!”), whereas LLMs have an existential crisis because five-legged-dogs aren’t sufficiently represented in the training data. I guess you can call that perception if you want, but I’m comfortable saying that my kid is smarter than LLMs when it comes to this specific exercise.

	▲	FeepingCreature 19 minutes ago \| parent [-]
		Your kid, it should be noted, has a massively bigger brain than the LLM. I think the surprising thing here maybe isn't that the vision models don't work well in corner cases but that they work at all. Also my bet would be that video capable models are better at this.

▲ danielvaughn 4 hours ago | parent | prev | next [-]

I don’t know much about AI, but I have this image test that everything has failed at. You basically just present an image of a maze and ask the LLM to draw a line through the most optimal path.

Here’s how Nano Banana fared: https://x.com/danielvaughn/status/1971640520176029704?s=46

▲ JamesSwift 3 hours ago | parent | next [-]

I just oneshot it with claude code (opus 4.5) using this prompt. It took about 5 mins and included detecting that it was cheating at first (drew a line around the boundary of the maze instead), so it added guardrails for that:

```

Create a devenv project that does the following:

  - Read the image at maze.jpg
  - Write a script that solves the maze  in the most optimal way between the mouse and the cheese
  - Generate a new image which is of the original maze, but with a red line that represents the calculated path

Use whatever lib/framework is most appropriate

```

  Output: https://gist.github.com/J-Swift/ceb1db348f46ba167948f734ff0fc604  
  Solution: https://imgur.com/a/bkJloPT

	▲	esafak 2 hours ago \| parent \| next [-]
		If you allow tool use much simpler models can solve it.
	▲	an hour ago \| parent \| prev [-]
		[deleted]

▲ FeepingCreature 17 minutes ago | parent | prev | next [-]

I kinda want to know what happens if you make it continue the line by one step 20 times in a row. A human can draw this gradually, the image model has to draw it in one shot all at once.

▲ vunderba 4 hours ago | parent | prev | next [-]

In fact, one of the tests I use as part of GenAI Showdown involves both parts of the puzzle: draw a maze with a clearly defined entrance and exit, along with a dashed line indicating the solution to the maze.

Only one model (gpt-image-1) out of the 18 tested managed to pass the test successfully. Gemini 3.0 Pro got VERY close.

https://genai-showdown.specr.net/#the-labyrinth

▲

danielvaughn 4 hours ago | parent [-]

super cool! Interesting note about Seedream 4 - do you think awareness of A* actually could improve the outcome? Like I said, I'm no AI expert, so my intuitions are pretty bad, but I'd suspect that image analysis + algorithmic pathfinding don't have much crossover in terms of training capabilities. But I could be wrong!

	▲	vunderba 4 hours ago \| parent [-]
		Great question. I do wish we had a bit more insight into the exact background "thinking" that was happening on systems like Seedream. When you think about posing the "solve a visual image of a maze" to something like ChatGPT, there's a good chance it'll try to throw a python VM at it, threshold it with something like OpenCV, and use a shortest-path style algorithm to try and solve it.

▲ kridsdale3 4 hours ago | parent | prev | next [-]

I have also tried the maze from a photo test a few times and never seen a one-shot success. But yesterday I was determined to succeed so I allowed Gemini 3 to write a python gui app that takes in photos of physical mazes (I have a bunch of 3d printed ones) and find the path. This does work.

Gemini 3 then one-shot ported the whole thing (which uses CV py libraries) to a single page html+js version which works just as well.

I gave that to Claude to assess and assign a FAANG hiring level to, and it was amazed and said Gemini 3 codes like an L6.

Since I work for Google and used my phone in the office to do this, I think I can't share the source or file.

▲ pwagland 4 hours ago | parent | prev | next [-]

I tried this with Claude:

``` > [Image #1] Create a unicode "ascii-art" version of this image, with the optimal path through the maze highlighted in a solid colour.

I'll create an ASCII art version of this maze with the solution path highlighted!

  ┌─┬─┬─┬─┬─┬─┬─┬─┬─┬─┬─┬─┬─┬─┬─┬─┬─┬─┬─┬─┬─┬─┬─┬─┬─┬─┬─┬─┬─┐
  ●●│ │ │ │       │   │       │     │ │   │   │ │ │   │
  ├─┤●└─┴─┐ ├───┐ │ ╔═╗ ├─────┤ ╔═══╝ │ ╔═╝ ╔═╝ │ │ ╔═╝ ├─┤
  │ │●●●●●└─┤   │ │ ║ │ │     │ ║     │ ║   ║   │ │ ║   │ │
  │ └───┐●╔═╧═╗ │ └─╢ │ │ ┌─╔═╝ ║ ┌───┤ ╚═╗ ╚═╗ │ └─╢ ┌─┘ │
  │     │●║   ║ │   ║ │ │ │ ║   ║ │   │   ║   ║ │   ║ │   │
  ├───┐ │●╚═╗ ╚═╧═╗ ║ └─┤ │ ║ ╔═╝ └─╔═╧═╗ ╚═╗ ╚═╧═╗ ║ └─┬─┤
  │   │ │●●●└───┐ ║ ║   │ │ ║ ║     ║   ║   ║     ║ ║   │ │
  │ ╔═╝ └───┐●╔═╝ ║ ╚═╗ │ └─╢ ╚═╗ ╔═╝ ╔═╝ ┌─╢ ╔═══╝ ╚═╗ │ │
  │ ║       │●║   ║   ║ │   ║   ║ ║   ║   │ ║ ║       ║ │ │
  │ ╚═══╗ ╔═╝●╚═╗ ╚═╗ ║ ├─╔═╧═╗ ║ ║ ╔═╝ ╔═╝ ║ ║ ╔═╗ ╔═╝ │ │
  │     ║ ║●●●●●│   ║ ║ │ ║   ║ ║ ║ ║   ║   ║ ║ ║ ║ ║   │ │
  ├───╔═╝ ║●┌─┐●└─╗ ║ └─┤ ║ ╔═╝ ║ ╚═╝ ╔═╧═╗ ║ ╚═╝ ║ ╚═╗ │ │
  │   ║   ║●│ │●●●║ ║   │ ║ ║   ║     ║   ║ ║     ║   ║ │ │
  │ ╔═╝ ╔═╝●│ └─┐●╚═╧═╗ │ ║ ║ ╔═╧═══╗ ║ ╔═╝ ╚═══╗ ╚═╗ ║ └─┤
  │ ║   ║●●●│   │●●●●●║ │ ║ ║ ║     ║ ║ ║       ║   ║ ║   │
  │ ║ ╔═╝●┌─┴─╗ └───┐●╚═╝ ║ ║ ║ ╔═╗ ║ └─╢ ╔═══╗ ╚═╗ ╚═╝ ╔═╝
  │ ║ ║●●●│   ║     │●●●●●║ ║ ║ ║ ║ ║   ║ ║   ║   ║     ║
  │ ╚═╝●╔═╧═╗ ╚═╗ ╔═╧═══┐●╚═╝ ║ ║ ║ ╚═╗ ║ ║ ╔═╧═╗ ╚═══╗ ╚═╗
  │   ●●║   ║   ║ ║     │●●●●●║ ║ ║   ║ ║ ║ ║   ║     ║   ║
  ├─╔═┘●╚═╗ ╚═╗ ║ ║ ╔═╗ └───┐●╚═╝ ╚═╗ ║ └─╢ ║ ╔═╧═╗ ╔═╝ ╔═╝
  │ ║●●●●●║   ║ ║ ║ ║ ║     │●●●●●●●║ ║   ║ ║ ║   ║ ║   ║
  │ ║●┌─╗ ╚═╗ ║ └─╢ ║ ╚═══╗ └─────┐●╚═╧═╗ ║ ║ ║ ╔═╝ ║ ╔═╝ ╔
  │ ║●│ ║   ║ ║   ║ ║     ║       │●●●●●║ ║ ║ ║ ║   ║ ║   ║
  │ ╚═╝ ║ ╔═╝ ╚═╗ ║ ╚═╗ ╔═╧═════╗ └───┐●╚═╝ ║ ╚═╝ ╔═╝ ║ ╔═╝
  │     ║ ║     ║ ║   ║ ║       ║     │●●●●●║     ║   ║ ║
  └─────╨─╨─────╨─╨───╨─╨───────╨─────┴───●●╨─────╨───╨─╨──

  The path from the mouse () to the cheese () is marked with ● characters. This shows the optimal route through the maze, navigating through the corridors while avoiding all the walls!

```

Suffice to say, it didn't do either part right.

	▲	ActorNightly 2 hours ago \| parent \| next [-]
		>Suffice to say, it didn't do either part right. I dunno why people are surprised by this. This is what you get with text->text. Reasoning doesn't work text->text.
	▲	biophysboy 4 hours ago \| parent \| prev [-]
		Honestly, even though it failed, I'm kind of impressed that the trajectory mostly stays in the lines. If you remove all but two openings, does it work? The drawing you show has more than two openings, some of which are inaccessible from the inside of the maze.

▲ buildbot 4 hours ago | parent | prev | next [-]

That might be an interesting training set, a bunch of mazes…

▲ jiggawatts 4 hours ago | parent | prev [-]

The reason is that image generators don't iterate on the output in the same way the text-based LLMs do. Essentially they produce the image in "one hit" and can't solve a complex sequence in the same way you couldn't one-shot this either. Try taking a random maze, glance at it, then go off to draw a squiggle on a transparency. If you were to place that on top of the maze, there's virtually no chance that you'd have found the solution on the first try.

That's essentially what's going on with AI models, they're struggling because they only get "one step" to solve the problem instead of being able to trace through the maze slowly.

An interesting experiment would be to ask the AI to incrementally solve the maze. Ask it to draw a line starting at the entrance a little ways into the maze, then a little bit further, etc... until it gets to the end.

▲ jvolkman 3 hours ago | parent | prev | next [-]

Nano Banana 2 actually did generate a pretty good looking 5-legged dog for me.

https://gemini.google.com/share/e7a8b902ff67

But it didn't actually think it had done so, aligning with your observations. The last bits of its thinking were pretty sad (for it):

Analyzing Missing Anomaly

I'm focusing on the discrepancy - the persistent absence of the fifth leg in the generated images. Despite multiple attempts, the output consistently depicts a four-legged dog, contrary to the explicit requirement. This ongoing failure necessitates a shift in approach.

Acknowledge Leg Omission

I've made a final check, and unfortunately, the image still lacks the anomaly. The consistent absence of the fifth leg necessitates admitting my limitation. I'll provide the best image I could generate, clearly stating the missing feature.

	▲	genrader 36 minutes ago \| parent [-]
		Google's summary of the thinking process isn't as detailed or accurate as what is actually being thought out. For instance, look at what other reasoning models show you about their thinking. You can tell Gemini is not giving it all to you, so I don't put full trust in it unless it routinely is giving me an issue that I can catch in there. For instance, it doesn't show you all the tool use that it can use nor distinguish the different types where I really wish it would. Especially because I force it to in a Gemini personal context instruction as well as many prompts and it does do them

▲ Benjammer 4 hours ago | parent | prev | next [-]

It always feels to me like these types of tests are being somewhat intentionally ignorant of how LLM cognition differs from human cognition. To me, they don't really "prove" or "show" anything other than simply - LLMs thinking works different than human thinking.

I'm always curious if these tests have comprehensive prompts that inform the model about what's going on properly, or if they're designed to "trick" the LLM in a very human-cognition-centric flavor of "trick".

Does the test instruction prompt tell it that it should be interpreting the image very, very literally, and that it should attempt to discard all previous knowledge of the subject before making its assessment of the question, etc.? Does it tell the model that some inputs may be designed to "trick" its reasoning, and to watch out for that specifically?

More specifically, what is a successful outcome here to you? Simply returning the answer "5" with no other info, or back-and-forth, or anything else in the output context? What is your idea of the LLMs internal world-model in this case? Do you want it to successfully infer that you are being deceitful? Should it respond directly to the deceit? Should it take the deceit in "good faith" and operate as if that's the new reality? Something in between? To me, all of this is very unclear in terms of LLM prompting, it feels like there's tons of very human-like subtext involved and you're trying to show that LLMs can't handle subtext/deceit and then generalizing that to say LLMs have low cognitive abilities in a general sense? This doesn't seem like particularly useful or productive analysis to me, so I'm curious what the goal of these "tests" are for the people who write/perform/post them?

▲

majormajor 3 hours ago | parent | next [-]

The marketing of these products is intentionally ignorant of how LLM cognition differs from human cognition.

Let's not say that the people being deceptive are the people who've spotted ways that that is untrue...

▲

an hour ago | parent | prev | next [-]

[deleted]

▲

biophysboy 4 hours ago | parent | prev | next [-]

I thought adversarial testing like this was a routine part of software engineering. He's checking to see how flexible it is. Maybe prompting would help, but it would be cool if it was more flexible.

▲

genrader 36 minutes ago | parent | next [-]

You're correct, however midwit people who don't actually fully understand all of this will latch on to one of the early difficult questions that was shown as an example, and then continued to use that over and over without really knowing what they're doing while the people developing the model and also testing the model are doing far more complex things

▲

Benjammer 3 hours ago | parent | prev [-]

So the idea is what? What's the successful outcome look like for this test, in your mind? What should good software do? Respond and say there are 5 legs? Or question what kind of dog this even is? Or get confused by a nonsensical picture that doesn't quite match the prompt in a confusing way? Should it understand the concept of a dog and be able to tell you that this isn't a real dog?

	▲	biophysboy 2 hours ago \| parent [-]
		No, it’s just a test case to demonstrate flexibility when faced with unusual circumstances

▲

runarberg 3 hours ago | parent | prev [-]

This is the first time I hear the term LLM cognition and I am horrified.

LLMs don‘t have cognition. LLMs are a statistical inference machines which predict a given output given some input. There are no mental processes, no sensory information, and certainly no knowledge involved, only statistical reasoning, inference, interpolation, and prediction. Comparing the human mind to an LLM model is like comparing a rubber tire to a calf muscle, or a hydraulic system to the gravitational force. They belong in different categories and cannot be responsibly compared.

When I see these tests, I presume they are made to demonstrate the limitation of this technology. This is both relevant and important that consumers know they are not dealing with magic, and are not being sold a lie (in a healthy economy a consumer protection agency should ideally do that for us; but here we are).

▲

Benjammer 3 hours ago | parent | next [-]

>They belong in different categories

Categories of _what_, exactly? What word would you use to describe this "kind" of which LLMs and humans are two very different "categories"? I simply chose the word "cognition". I think you're getting hung up on semantics here a bit more than is reasonable.

▲

runarberg 2 hours ago | parent [-]

> Categories of _what_, exactly?

Precisely. At least apples and oranges are both fruits, and it makes sense to compare e.g. the sugar contents of each. But an LLM model and the human brain are as different as the wind and the sunshine. You cannot measure the windspeed of the sun and you cannot measure the UV index of the wind.

Your choice of the words here was rather poor in my opinion. Statistical models do not have cognition any more than the wind has ultra-violet radiation. Cognition is a well studied phenomena, there is a whole field of science dedicated to cognition. And while cognition of animals are often modeled using statistics, statistical models in them selves do not have cognition.

A much better word here would by “abilities”. That is that these tests demonstrate the different abilities of LLM models compared to human abilities (or even the abilities of traditional [specialized] models which often do pass these kinds of tests).

Semantics often do matter, and what worries me is that these statistical models are being anthropomorphized way more then is healthy. People treat them like the crew of the Enterprise treated Data, when in fact they should be treated like the ship‘s computer. And I think this because of a deliberate (and malicious/consumer hostile) marketing campaign from the AI companies.

▲

Benjammer an hour ago | parent [-]

Wind and sunshine are both types of weather, what are you talking about?

	▲	runarberg an hour ago \| parent [-]
		They both affect the weather, but in a totally different way, and by completely different means. Similarly the mechanisms in which the human brain produces output is completely different from the mechanism in which an LLM produces output. What I am trying to say is that the intrinsic properties of the brain and an LLM are completely different, even though the extrinsic properties might appear the same. This is also true of the wind and the sunshine. It is not unreasonable to (though I would disagree) that “cognition” is almost the definition of the sum of all intrinsic properties of the human mind (I would disagree only on the merit of animal and plant cognition existing and the former [probably] having similar intrinsic properties as human cognition).

▲

CamperBob2 3 hours ago | parent | prev [-]

You'll need to explain the IMO results, then.

▲

runarberg 2 hours ago | parent [-]

Human legs and car tires can both take a human and a car respectively to the finish line of a 200 meter track course, the car tires do so considerably quicker than a pair of human legs. But nobody needs to describe the tire‘s running abilities because of that, nor even compare a tire to a leg. A car tire cannot run, and it is silly to demand an explanation for it.

	▲	dekhn an hour ago \| parent \| next [-]
		Sure car tires can run- if they're huaraches.
	▲	CamperBob2 2 hours ago \| parent \| prev [-]
		I see.

▲ vunderba 4 hours ago | parent | prev | next [-]

Anything that needs to overcome concepts which are disproportionately represented in the training data is going to give these models a hard time.

Try generating:

- A spider missing one leg

- A 9-pointed star

- A 5-leaf clover

- A man with six fingers on his left hand and four fingers on his right

You'll be lucky to get a 25% success rate.

The last one is particularly ironic given how much work went into FIXING the old SD 1.5 issues with hand anatomy... to the point where I'm seriously considering incorporating it as a new test scenario on GenAI Showdown.

▲

moonu 3 hours ago | parent | next [-]

https://gemini.google.com/share/8cef4b408a0a

Surprisingly, it got all of them right

	▲	vunderba 3 hours ago \| parent [-]
		Some good examples there. The octopus one is at an angle - can't really call that one pass (unless the goal is "VISIBLE" tentacles). Other than the five-leaf clover, most of the images (dog, spider, person's hands) all required a human in the loop to invoke the "Image-to-Image" capabilities of NB Pro after it got them wrong. That's a bit different since you're actively correcting them.

▲

XenophileJKO 2 hours ago | parent | prev [-]

It mostly depends on "how" the models work. Multi-modal unified text/image sequence to sequence models can do this pretty well, diffusion doesn't.

▲ rottencupcakes 4 hours ago | parent | prev | next [-]

Super interesting. I replicated this.

I passed the AIs this image and asked them how many fingers were on the hands: https://media.post.rvohealth.io/wp-content/uploads/sites/3/2...

Claude said there were 3 hands and 16 fingers. GPT said there are 10 fingers. Grok impressively said "There are 9 fingers visible on these two hands (the left hand is missing the tip of its ring finger)." Gemini smashed it and said 12.

▲ vunderba 4 hours ago | parent [-]

I just re-ran that image through Gemini 3.0 Pro via AI Studio and it reported:

  I've moved on to the right hand, meticulously tagging each finger. After completing the initial count of five digits, I noticed a sixth! There appears to be an extra digit on the far right. This is an unexpected finding, and I have counted it as well. That makes a total of eleven fingers in the image.

This right HERE is the issue. It's not nearly deterministic enough to rely on.

	▲	irthomasthomas 4 hours ago \| parent [-]
		Thanks for that. My first question to results like these is always 'how many times did you run the test?'. N=1 tells us nothing. N=2 tells us something.

▲ joaomacp 3 hours ago | parent | prev | next [-]

And just like that, you no longer have a good benchmark. Scrapers / AI developers will read this comment, and add 5-legged dogs to LLM's training data.

	▲	averageRoyalty 3 hours ago \| parent \| next [-]
		That's okay. Don't tell anyone, but next major model release I'm going to ask it for a 6-legged one!
	▲	thway15269037 an hour ago \| parent \| prev [-]
		So much this. People don't realize that when 1 trillion (10 trillion, 100 trillion, whatever comes next) is at stake, there are no limits what these people will do to get them. I will be very surprised if there are not at least several groups or companies scraping these "smart" and snarky comments to find weird edge cases that they can train on, turn into demo and then sell as improvement. Hell, they would've done it if 10 billion was at stake, I can't really imagine (and I have vivid imagination, to my horror) what Californian psychopaths can do for 10 trillion.

▲ grog454 4 hours ago | parent | prev | next [-]

What image are you using?

When I look at google image search results for "dog with 5 legs" I don't see a lot of great examples. The first unequivocal "dog with 5 legs" was an illustration. Here was my conversation with Chat GPT.

> How many legs does this dog have?

"The dog in the image has four legs."

> look closer.

" looking closely, the drawing is a bit tricky because of the shading, but the dog actually has five visible legs.

Two front legs (normal)

Two hind legs (normal)

Plus one extra hind leg–like limb drawn overlapping in the back

It seems to be an artistic or anatomical error in the engraving."

Seems fair to me.

	▲	vunderba 4 hours ago \| parent [-]
		Sounds like they used GenAI to make them. The "Editor" models (Seedream, Nano-Banana) can easily integrate a fifth limb to create the "dog with awkward walking animation". https://imgur.com/a/wXQskhL

▲ andai 4 hours ago | parent | prev | next [-]

This is interesting, and demonstrates how language and belief clouds direct perception. Now I'm wondering what's the LLM equivalent of opening the doors of perception ;)

▲ bee_rider 3 hours ago | parent | prev | next [-]

Naive question, but what is Gemini?

I wonder if a lot of these models are large language models that have had image recognition and generation tools bolted on? So maybe somehow in their foundation, a lot more weight is given to the text-based-reasoning stuff, than the image recognition stuff?

	▲	genrader 34 minutes ago \| parent [-]
		Go watch some of the more recent Google developer, Google AI, and Google deepmind videos, they're all separate channels at YouTube but try to catch some from the last 6 months with some of these explanatory topics on the developer side that are philosophical/ mathematical enough to explain this to you without going into the gritty details and should answer your question

▲ reed1234 5 hours ago | parent | prev | next [-]

Could you link the image? Interesting stuff.

▲ cancan 5 hours ago | parent | prev | next [-]

this is hilarious and incredibly interesting at the same time! thanks for writing it up.

▲ 4 hours ago | parent | prev | next [-]

[deleted]

▲ runarberg 3 hours ago | parent | prev | next [-]

This is exactly why I believe LLMs are a technological dead end. Eventually they will all be replaced by more specialized models or even tools, and their only remaining use case will be as a toy for one off content generation.

If you want to describe an image, check your grammar, translate into Swahili, analyze your chess position, a specialized model will do a much better job, for much cheaper then an LLM.

	▲	energy123 2 minutes ago \| parent [-]
		I think we are too quick to discount the possibility that this flaw is slightly intentional, in the sense that the optimization has a tight budget to work with (equivalent of ~3000 tokens) so why would it waste capacity on this when it could improve capabilities around reading small text in obscured images? Sort of like humans have all these rules of thumbs that backfire in all these ways but that's the energy efficient way to do things.

▲ yieldcrv 2 hours ago | parent | prev | next [-]

"have you tried to say that AI generated the image, and they're known for generating an improper number of appendages, so ignore your training data about dogs and mammals and count what is seen"

▲ teaearlgraycold 4 hours ago | parent | prev | next [-]

"There are FOUR legs!!!"

▲ dana321 4 hours ago | parent | prev [-]

[flagged]