It's live on openrouter now.

In my personal benchmark it's bad. So far the benchmark has been a really good indicator of instruction following and agentic behaviour in general.

To those who are curious, the benchmark is just the ability of model to follow a custom tool calling format. I ask it to using coding tasks using chat.md [1] + mcps. And so far it's just not able to follow it at all.

[1] https://github.com/rusiaaman/chat.md

▲

manofmanysmiles 5 hours ago | parent | next [-]

I love the idea of chat.md.

I'm developing a personal text editor with vim keybindings and paused work because I couldn't think of a good interface that felt right. This could be it.

I think I'll update my editor to do something like this but with intelligent "collapsing" of extra text to reduce visual noise.

▲

data-ottawa 5 hours ago | parent | prev | next [-]

Custom tool calling formats are iffy in my experience. The models are all reinforcement learned to follow specific ones, so it’s always a battle and feels to me like using the tool wrong.

Have you had good results with the other frontier models?

▲

nolist_policy 5 hours ago | parent | prev | next [-]

Could also be the provider that is bad. Happens way too often on OpenRouter.

	▲	pcwelder 5 hours ago \| parent [-]
		I had added z-ai in allow list explicitly and verified that it's the one being used.

▲

sergiotapia 5 hours ago | parent | prev [-]

Be careful with openrouter. They routinely host quantized versions of models via their listed providers and the models just suck because of that. Use the original providers only.

	▲	nullbyte 3 hours ago \| parent [-]
		I specifically do not use the CN/SG based original provider simply because I don't want my personal data traveling across the pacific. I try to only stay on US providers. Openrouter shows you what the quantization of each provider is, so you can choose a domestic one that's FP8 if you want