Remix.run Logo
petercooper 10 days ago

Its image processing is terrible. I ran several tests against it against Qwen 3.5 0.8b (yes, 7% the size) and Qwen beat it every time with Gemma often getting things entirely wrong. I even gave it a plain image saying "This is a test" and it thought for 6 minutes trying to analyze it and failed. Qwen 3.5 0.8b confidently got it in under a second.

It may be that the Q6 quant I got is borked (or my LM Studio is), but either way, the 0.8b's performance is mind boggling in comparison.

CMay 9 days ago | parent | next [-]

For Qwen 3.5 0.8B presumably you're running it unquantized, because it's so small. Get at least the Q8 of Gemma 4 12B with the F32 mmproj and use an f16 kv cache.

Then run it with the latest llama.cpp that contains the Gemma 4 12B unified bug fixes, using --image-min-tokens 560 --image-max-tokens 2240 --batch-size 4096 --ubatch-size 4096 --temp 1.0 --top-p 0.95 --top-k 64 --jinja

It's understanding far more complex things for me and can reliably handle tiny text, so it should be easily understanding an image that only contains the text "This is a test".

usef- 9 days ago | parent | prev | next [-]

That sounds like a bug. They're very common for open model releases on the first day. If I wasn't on mobile I'd try it on Google's own app.

JacobAsmuth 9 days ago | parent | prev | next [-]

Sounds like you're doing it wrong, to be honest.

ma2kx 10 days ago | parent | prev | next [-]

I guess Google implements more / stronger guard rails than Alibaba and thus confuses these small models. At least this was my impression with Gemma3 models where it often said that the image contains some nudity / sex scenes and therefore it cannot give a description of the image. Never understood the point of this behavior....

jimmy76615 9 days ago | parent [-]

The biggest problem with all the Google models has always been RLHF, particularly safety training. They take a good, smart model and make it behave like a corporate person that has been to far to many forced anti-{sexism, racism...} seminars so that it is now living in fear of saying something that could be construed as wrong by some moral standard.

staticman2 9 days ago | parent | next [-]

This is almost certainly not true.

If it was, they wouldn't need to be using the classifiers they are using to warn Gemini about problematic prompts.

ai_fry_ur_brain 9 days ago | parent | prev [-]

[flagged]

thot_experiment 10 days ago | parent | prev | next [-]

I've always found the Gemma models to vastly under-perform on vision tasks compared to Qwen so that's nothing new.

mountainriver 9 days ago | parent [-]

The Qwen series adopted vision wayyy earlier than anyone else. No idea why the other labs were sleeping on it but they had about 2 years of experimentation without any competition.

staticman2 9 days ago | parent | prev [-]

Test it on a professional inference provider to rule out trouble on your end.