| ▲ | pcwelder 6 hours ago | |||||||
It's live on openrouter now. In my personal benchmark it's bad. So far the benchmark has been a really good indicator of instruction following and agentic behaviour in general. To those who are curious, the benchmark is just the ability of model to follow a custom tool calling format. I ask it to using coding tasks using chat.md [1] + mcps. And so far it's just not able to follow it at all. | ||||||||
| ▲ | manofmanysmiles 5 hours ago | parent | next [-] | |||||||
I love the idea of chat.md. I'm developing a personal text editor with vim keybindings and paused work because I couldn't think of a good interface that felt right. This could be it. I think I'll update my editor to do something like this but with intelligent "collapsing" of extra text to reduce visual noise. | ||||||||
| ▲ | data-ottawa 5 hours ago | parent | prev | next [-] | |||||||
Custom tool calling formats are iffy in my experience. The models are all reinforcement learned to follow specific ones, so it’s always a battle and feels to me like using the tool wrong. Have you had good results with the other frontier models? | ||||||||
| ▲ | nolist_policy 5 hours ago | parent | prev | next [-] | |||||||
Could also be the provider that is bad. Happens way too often on OpenRouter. | ||||||||
| ||||||||
| ▲ | sergiotapia 5 hours ago | parent | prev [-] | |||||||
Be careful with openrouter. They routinely host quantized versions of models via their listed providers and the models just suck because of that. Use the original providers only. | ||||||||
| ||||||||