Our team has found success in treating skills more like re-usable semi-deterministic functions and less like fingers-crossed prompts for random edge-cases.

For example, we have a skill to /create-new-endpoint. The skill contains a detailed checklist of all the boilerplate tasks that an engineer needs to do in addition to implementing the logic (e.g. update OpenAPI spec, add integration tests, endpoint boilerplate, etc.). The engineer manually invokes the skill from the CLI via slash commands, provides a JIRA ticket number, and engages in some brief design discussion. The LLM is consistently able to one-shot these tickets in a way that matches our existing application architecture.

▲

mooreds 7 hours ago | parent [-]

How do you test these skills for consistency over time, or is that not needed?

	▲	theshrike79 7 hours ago \| parent \| next [-]
		The same way you'd test a human following written instructions over time. Check the results.
	▲	pizzafeelsright 5 hours ago \| parent \| prev [-]
		My experience has been that if the skill is broken down into a function, possibly paired with a validator in another stage, you're at 99.9% deterministic. I have not yet tested this at scale but give me six months.