The finding that self-generated skills provide negative benefit (-1.3pp) while curated skills give +16.2pp is the most interesting result here imo. Big discrepancy, but makes sense. Aligns with the thought that LLMs are better consumers of procedural knowledge than producers of it.

+4.5pp for software engineering is suspiciously low compared to +51.9pp for healthcare. I suspect this reflects that frontier models already have strong SWE priors from training data, so skills add less marginal value. If true, skills become most valuable precisely in the domains where models are weakest — which is where you'd actually want to deploy agents in production. That's encouraging.

▲

cheema33 6 hours ago | parent | next [-]

> +4.5pp for software engineering is suspiciously low compared to +51.9pp for healthcare.

This stood out for me as well. I do think that LLMs have a lot of training data on software engineering topics and that perhaps explains the large discrepancy. My experience has been that if I am working with a software library or tool that is very new or not commonly used, skills really shine there. Example: Adobe React Spectrum UI library. Without skills, Opus 4.6 produces utter garbage when trying to use this library. With properly curated/created skills, it shines. Massive difference.

	▲	D-Machine 37 minutes ago \| parent [-]
		Nothing other to say than I appreciate you sharing these explicit details and insights here.

▲

hardware2415 8 hours ago | parent | prev [-]

[flagged]

	▲	nvader 7 hours ago \| parent \| next [-]
		Hmm, not for me, but I'm curious if there are signatures I'm missing. To me, author reads like an articulate native English speaker, but typing on their phone.
	▲	jeron 7 hours ago \| parent \| prev \| next [-]
		not all em-dash users are AI!
	▲	jibal 6 hours ago \| parent \| prev \| next [-]
		All ad hominems are irrational but that one is worse than most.
	▲	7 hours ago \| parent \| prev [-]
		[deleted]