Haven’t looked at the code, but is the server providing the client with a system prompt that it can use, which would contain fake tool definitions when this is enabled? What enables it? And why is the client still functional when it’s giving the server back a system prompt with fake tool definitions? Is the LLM trained to ignore those definitions?

Wonder if they’re also poisoning Sonnet or Opus directly generating simulated agentic conversations.

▲

cedws 4 hours ago | parent [-]

Not sure, and not completely convinced of the explanation, but the way this sticks out so obviously makes it look like a honeypot to me.

▲

mmaunder 28 minutes ago | parent [-]

Great theory. I'll dig deeper.

	▲	mmaunder 9 minutes ago \| parent [-]
		Claude Code has a server-side anti-distillation opt-in called fake_tools, but the local code does not show the actual mechanism. The client sometimes sends anti_distillation: ['fake_tools'] in the request body at services/api/claude.ts:301 The client still sends its normal real tools: allTools at services/api/claude.ts:1711 If the model emits a tool name the client does not actually have, the client turns that into No such tool available errors at services/tools/StreamingToolExecutor.ts:77 and services/tools/toolExecution.ts:369 If Anthropic were literally appending extra normal tool definitions to the live tool set, and Claude used them, that would be user-visible breakage. That leaves a few more plausible possibilities: Fake_tools is just the name of the server-side experiment, but the implementation is subtler than “append fake tools to the real tool list.” or The server may inject tool-looking text into hidden prompt context, with separate hidden instructions not to call it. or The server may use decoys only in an internal representation that is useful for poisoning traces/training data but not exposed as real executable tools.