| ▲ | xpct 9 hours ago |
| I've been largely disappointed how much the Claude models ignore custom instructions, and sometimes even prompts on the chat interface. It sometimes feels like talking to a wall, or as if there was a third person in the chatroom whose messages I can't see. I can't help but feel this is intentional towards the 'Agentic' workflow. |
|
| ▲ | spacephysics 9 hours ago | parent | next [-] |
| I think this seems purposeful, as there's 2 opposing forces at play:
- Have a model that follows the users instructions
- Have a model that follows the system prompt instructions more For the 'safety' argument (Re: Fable), they need these models to have basically a 2-tier instruction system, but given LLMs aren't great with actual Logic unless they program it out to test, this runs afoul and we get one or the other. Feels like optimizing for either precision or recall, but can't have both |
| |
| ▲ | paradox460 3 hours ago | parent | next [-] | | We're speed running HAL 9000 | |
| ▲ | wqaatwt 9 hours ago | parent | prev [-] | | A suppose a solution might be going with a customizable harness like pi and merging Anthropic’s system prompt with a personalized custom one to remove all contractions | | |
| ▲ | arcanemachiner 8 hours ago | parent [-] | | You still have to manage/fight with the post-training that is baked into the model itself. |
|
|
|
| ▲ | manveerc 9 hours ago | parent | prev | next [-] |
| Totally agreed. I sometimes wonder if they are making the model "lazy" with each iteration, it keeps getting better at avoiding work. |
| |
| ▲ | skerit 9 hours ago | parent [-] | | This is why Fable was so good. It followed instructions and it was in no way lazy. | | |
| ▲ | DontchaKnowit 9 hours ago | parent | next [-] | | People keep making comments about fable like this? You could only use it for what like a week? How is that at all enough time to evaluate? Opus 4.6 didnt suffer from this problems for a hot minute and then when newer models were released it got worse. I think they change a ton behind the scenes and allocate compute however they want, so the model you use today may behave much differently than how it behaved yesterday | | |
| ▲ | pdimitar 8 hours ago | parent | next [-] | | > You could only use it for what like a week? How is that at all enough time to evaluate? By observing how in 4 workdays it achieved more than Opus in ~11 days. I am my team's backend lead and the Fable 5 model finally turned the tide on my overwhelming backlog. Back to Opus and I have to treat it like special-education kid multiple times a day. | |
| ▲ | boc 8 hours ago | parent | prev | next [-] | | The ~72 hours I had access to Fable were by far the most productive I've had in months. Re-wrote massive parts of my codebase and caught a ton of bugs and logic issues that had silently slipped through before. I went over my subscription limit and immediately kept paying the API price to keep going. It was that good. | |
| ▲ | plorkyeran 8 hours ago | parent | prev | next [-] | | It was a pretty stark difference. I had the opposite problem where it did too much and overshot what I wanted from it so I certainly assume that if it had stuck around it would have gotten tuned back a bit pretty quickly. | |
| ▲ | marcindulak 6 hours ago | parent | prev | next [-] | | For me claude-fable-5 failed to follow the instruction following test I'm making against various models https://github.com/marcindulak/claude-fails-to-follow-claude... | |
| ▲ | tskj 8 hours ago | parent | prev | next [-] | | You didn't really have to use it more than a day honestly to tell what kind of shocking paradigm change it was. Man do I miss it. | |
| ▲ | Analemma_ 8 hours ago | parent | prev [-] | | Heh, it's not crazy if you're here in the Bay: I know multiple people who more-or-less disappeared for days when Fable came out because they were running their benchmarks, and only emerged blinking into the sunlight when the USG banned it. That's just how things are here now, most people are normal but there are some serious LLM dope addicts out and about. |
| |
| ▲ | acters 9 hours ago | parent | prev [-] | | I've been seeing LLMs act lazy from the very beginning. They got a little better but smaller models really only want to have a single task given to them. Mythos at least does work. RIP |
|
|
|
| ▲ | marcindulak 6 hours ago | parent | prev | next [-] |
| I keep adding selected cases of CLAUDE.md instructions non-compliance reported on claude-code github to that issue https://github.com/anthropics/claude-code/issues/13689. Subjectively the amount of such cases seems lower during the past month. It may be that claude-opus-4-8 (default thinking) is a bit better at instructions following than past models. |
|
| ▲ | gs17 9 hours ago | parent | prev | next [-] |
| > or as if there was a third person in the chatroom whose messages I can't see. If you set off a classifier, that's how it looks to Claude. |
| |
| ▲ | xpct 9 hours ago | parent [-] | | I wasn't working with anything sensitive, but it really does feel like it sometimes condenses even something low like three bullet points to two. IMO, they were quite good with checklists even a year ago, and tried to tick off each one. |
|
|
| ▲ | storus 9 hours ago | parent | prev | next [-] |
| Try to run your prompts through Claude to pinpoint any ambiguous parts that can be interpreted in multiple ways, or self-contradictory sections. I typically resolve any prompt-ignoring issues with that. |
|
| ▲ | Sohcahtoa82 6 hours ago | parent | prev [-] |
| [dead] |