new | show | ask | jobs Github

XCSme 6 hours ago

Claude is quite bad at following instructions compared to other SOTA models.

As in, you tell it "only answer with a number", then it proceeds to tell you "13, I chose that number because..."

▲

wouldbecouldbe 6 hours ago | parent | next [-]

I think its why its so good; it works on half ass assumptions, poorly written prompts and assumes everything missing.

▲

XCSme 5 hours ago | parent | next [-]

To be honest, I had this "issue" too.

I upgraded to a new model (gpt-4o-mini to grok-4.1-fast), suddenly all my workflows were broken. I was like "this new model is shit!", then I looked into my prompts and realized the model was actually better at following instructions, and my instructions were wrong/contradictory.

After I fixed my prompts it did exactly what I asked for.

Maybe models should have another tuneable parameters, on how well it should respect the user prompt. This reminds me of imagegen models, where you can choose the config/guidance scale/diffusion strength.

▲

vidarh 5 hours ago | parent | prev [-]

I worked on a project that did fine tuning and RLHF[1] for a major provider, and you would not believe just how utterly broken a large proportion of the prompts (from real users) were. And the project rules required practically reading tea leaves to divine how to give the best response even to prompts that were not remotely coherent human language.

[1] Reinforcement learning from human feedback; basically participants got two model responses and had to judge them on multiple criteria relative to the prompt

	▲	redman25 an hour ago \| parent [-]
		I feel like the right response for those situations is to start asking questions of the user. It’s what a human would do if they did not understand.

▲

prmph 5 hours ago | parent | prev [-]

They all are. And once the context has rotted or been poisoned enough, it is unsalvageable.

Claude is now actually one of the better ones at instruction following I daresay.

	▲	XCSme 5 hours ago \| parent [-]
		In my tests it's worst with adding extra formatting or output: https://aibenchy.com/compare/anthropic-claude-opus-4-6-mediu... For example, sometimes it outputs in markdown, without being asked to (e.g. "13" instead of "13"), even when asked to respond with a number only. This might be fine in a chat-environment, but not in a workflow, agentic use-case or tool usage. Yes, it can be enforced via structured output, but in a string field from a structured output you might still want to enforce a specific natural-language response format, which can't be defined by a schema.