Remix.run Logo
onlyrealcuzzo 2 hours ago

I just tested this on a bug fixing benchmark I'm working on.

It did not perform as well as I expected. Qwen2.5-Coder-3B (2 years old) outperformed it by a wide range -> fixing ~50% of bugs whereas this model only fixed ~12%.

Granted, it's not a coder specific model, but given its benchmark performance to Gemma models, and that it's two years newer, and that it's an MoE with 8B total params, I expected it to be more competitive.

XCSme 44 minutes ago | parent | next [-]

I will test it when it's accessible via OpenRouter, but the previous LFM2 model (lfm-2-24b-a2b) didn't do well on my tests, it got only 1/20 questions/tasks right, way below Gemma 31B or Qwen 35b-a3b (those get like 10/20 right)

debazel an hour ago | parent | prev | next [-]

I tried it with OpenCode and it is borderline incapable of using tool calls, so that might be why it is doing so bad on your test.

peder 28 minutes ago | parent [-]

I just did the same. Absolutely awful. I assume OpenCode's heavy context is a problem, and it's probably better to use Liquid's own OpenCode alternative for this.

HanClinto 2 hours ago | parent | prev [-]

Some of the coding-specific fine-tunes were really impressive boosts. Qwen2.5-3B-Instruct is also available [0] -- if it's not too much to ask, I'd be curious how more general models stack up in your benchmark?

[0] - https://huggingface.co/Qwen/Qwen2.5-3B-Instruct