My personal experience is that it’s worthwhile to put instructions, user-manual style, into the context. These are things like:

- How to build.

- How to run tests.

- How to work around the incredible crappiness of the codex-rs sandbox.

I also like to put in basic style-guide things like “the minimum Python version is 3.12.” Sadly I seem to also need “if you find yourself writing TypeVar, think again” because (unscientifically) it seems that putting the actual keyword that the agent should try not to use makes it more likely to remember the instructions.

▲

mlaretallack 19 hours ago | parent | next [-]

I also try to avoid negative instructions. No scientific proof, just a feeling the same as you, "do not delete the tmp file" can lead too often to deleting the tmp file.

▲

strokirk 6 hours ago | parent [-]

It’s like instructing a toddler.

	▲	justanothersys 5 hours ago \| parent \| next [-]
		i definitely have gone so far as to treat my llm readable docs in this way and have found it very effective
	▲	hnbad 5 hours ago \| parent \| prev [-]
		I recall that early LLMs had the problem of not understanding the word "not", which became especially evident and problematic when tasked with summarizing text because the summary would then sometimes directly contradict the original text. It seems that that problem hasn't really been "fixed", it's just been paved over. But I guess that's the ugly truth most people tend to forget/deny about LLMs: you can't "fix" them because there's not a line of code you can point to that causes a "bug", you can only retrain them and hope the problem goes away. In LLMs, every bug is a "heisenbug" (or should that be "murphybug", as in Murphy's Law?).

▲

likium 7 hours ago | parent | prev | next [-]

For TypeVar I’d reach for a lint warning instead.

▲

bonesss 5 hours ago | parent | prev [-]

I also have felt like these kinds of efforts at instructions and agent files have been worthwhile, but I am increasingly of the opinion that such feelings represent self-delusion from seeing and expecting certain things aided by a tool that always agrees with my, or its, take on utility. The agent.md file looks like it’d work, it looks how you’d expect, but then it fails over and over. And the process of tweaking is pleasant chatting with supportive supposed insights and solutions, which means hours of fiddling with meta-documentation without clear rewards because of partial adherence.

The papers conclusions align with my personal experiments at managing a small knowledge base with LLM rules. The application of rules was inconsistent, the execution of them fickle, and fundamental changes in processing would happen from week-to-week as the model usage was tweaked. But, rule tweaking always felt good. The LLM said it would work better, and the LLM said it had read and understood the instructions and the LLM said it would apply them… I felt like I understoood how best to deliver data to the LLMs, only to see recurrent failures.

LLMs lie. They have no idea, no data, and no insights into specific areas, but they’ll make pleasant reality-adjacent fiction. Since chatting is seductive, and our time sense is impacted by talking, I think the normal time versus productivity sense is further pulled out of ehack. Devs are notoriously bad at estimating where they’re using time, long feedback loops filled with phone time and slow ass conversation don’t help.