Reversing the X and Y axis, adding in a few other random models, and dropping all the small Qwens makes this worse than useless as a Qwen 3.5 comparison, it’s actively misleading. If you’re using AI, please don’t rush to copy paste output :/

EDIT: Lordy, the small models are a shadow of Qwen's smalls. See https://huggingface.co/Qwen/Qwen3.5-4B versus https://www.reddit.com/r/LocalLLaMA/comments/1salgre/gemma_4...

▲

scrlk 2 days ago | parent | next [-]

I transposed the table so that it's readable on mobile devices.

I should have mentioned that the Qwen 3.5 benchmarks were from the Qwen3.5-122B-A10B model card (which includes GPT-5-mini and GPT-OSS-120B); apologies for not including the smaller Qwen 3.5 models.

	▲	refulgentis 2 days ago \| parent [-]
		It’s not readable on a phone either. Text wraps. unless you’re testing on foldable?

▲

BloondAndDoom a day ago | parent | prev [-]

Small qwen models are magical

	▲	refulgentis a day ago \| parent [-]
		It's so so good. I have an app I've been working on for 2.5 years and felt kinda stupid making sure llama.cpp worked everywhere, including Android and iOS. The 0.8B beats every <= 7B model I've used on tool use and can do RAG. Like you could ship it to someone who didn't know AI and it can do all the basics and leave UX intact.