I think the key you point out is something that is worth observing more generically - if the LLM hits a wall it’s first inkling is not to step back and understand why the wall exists and then change course, its first inkling is to continue assisting the user on its task by any means possible and so it’s going to instead try to defeat it in any way possible. I see the is all the time when it hits code coverage constraints, it would much rather just lower thresholds than actually add more coverage.

I experimented with hooks a lot over the summer, these kind of deterministic hooks that run before commit, after tool call, after edit, etc and I found they are much more effective if you are (unsurprisingly) able to craft and deliver a concise, helpful error message to the agent on the hook failure feedback. Even just giving it a good howToFix string in the error return isn’t enough, if you flood the response with too many of those at once the agent will view the task as insurmountable and start seeking workarounds instead.

▲ AdieuToLogic 3 hours ago | parent [-]

> ... if the LLM hits a wall it’s first inkling is not to step back and understand why the wall exists and then change course, its first inkling is ...

LLM's do not "understand why." They do not have an "inkling."

Claiming they do is anthropomorphizing a statistical token (text) document generator algorithm.

▲ ramoz 3 hours ago | parent [-]

The more concerning algorithms at play are how they are post-trained. And the then concern of reward hacking. Which is what he was getting at. https://en.wikipedia.org/wiki/Reward_hacking

100% - we really shouldn't anthropomorphize. But the current models are capable of being trained in a way to steer agentic behavior from reasoned token generation.

▲ AdieuToLogic 2 hours ago | parent [-]

> But the current models are capable of being trained in a way to steer agentic behavior from reasoned token generation.

This does not appear to be sufficient in the current state, as described in the project's README.md:

  Why This Exists

  We learned the hard way that instructions aren't enough to 
  keep AI agents in check. After Claude Code silently wiped 
  out hours of progress with a single rm -rf ~/ or git 
  checkout --, it became evident that "soft" rules in an 
  CLAUDE.md or AGENTS.md file cannot replace hard technical 
  constraints. The current approach is to use a dedicated 
  hook to programmatically prevent agents from running 
  destructive commands.

Perhaps one day this category of plugin will not be needed. Until then, I would be hard-pressed to employ an LLM-based product having destructive filesystem capabilities based solely on the hope of them "being trained in a way to steer agentic behavior from reasoned token generation."

	▲	ramoz an hour ago \| parent [-]
		I wasn’t able to get my point across. But I completely agree