Remix.run Logo
Compressed Agents.md > Agent Skills(vercel.com)
63 points by maximedupre 10 hours ago | 31 comments
tottenhm 2 hours ago | parent | next [-]

> In 56% of eval cases, the skill was never invoked. The agent had access to the documentation but didn't use it.

The agent passes the Turing test...

jgbuddy an hour ago | parent | prev | next [-]

Am I missing something here?

Obviously directly including context in something like a system prompt will put it in context 100% of the time. You could just as easily take all of an agent's skills, feed it to the agent (in a system prompt, or similar) and it will follow the instructions more reliably.

However, at a certain point you have to use skills, because including it in the context every time is wasteful, or not possible. this is the same reason anthropic is doing advanced tool use ref: https://www.anthropic.com/engineering/advanced-tool-use, because there's not enough context to straight up include everything.

It's all a context / price trade off, obviously if you have the context budget just include what you can directly (in this case, compressing into a AGENTS.md)

jstummbillig 2 minutes ago | parent | next [-]

> Obviously directly including context in something like a system prompt will put it in context 100% of the time.

How do you suppose skills get announced to the model? It's all in the context in some way. The interesting part here is: Just (relatively naively) compressing stuff in the AGENTS.md works seems to work "better" than however skills are implemented out of the box, for this use case.

observationist 39 minutes ago | parent | prev | next [-]

This is one of the reasons the RLM methodology works so well. You have access to as much information as you want in the overall environment, but only the things relevant to the task at hand get put into context for the current task, and it shows up there 100% of the time, as opposed to lossy "memory" compaction and summarization techniques, or probabilistic agent skills implementations.

Having an agent manage its own context ends up being extraordinarily useful, on par with the leap from non-reasoning to reasoning chats. There are still issues with memory and integration, and other LLM weaknesses, but agents are probably going to get extremely useful this year.

orlandohohmeier 40 minutes ago | parent | prev [-]

I’ve been using symlinked agent files for about a year as a hacky workaround before skils became a thing load additional “context” for different tasks, and it might actually address the issue you’re talking about. Honestly, it’s worked so well for me that I haven’t really felt the need to change it.

BenoitEssiambre 6 minutes ago | parent | prev | next [-]

Wouldn't this have been more readable with a \n newline instead of a pipe operator as a seperator? This wouldn't have made the prompt longer.

thorum an hour ago | parent | prev | next [-]

The article presents AGENTS.md as something distinct from Skills, but it is actually a simplified instance of the same concept. Their AGENTS.md approach tells the AI where to find instructions for performing a task. That’s a Skill.

I expect the benefit is from better Skill design, specifically, minimizing the number of steps and decisions between the AI’s starting state and the correct information. Fewer transitions -> fewer chances for error to compound.

ChrisArchitect 2 minutes ago | parent | prev | next [-]

Title is: AGENTS.md outperforms skills in our agent evals

newzino 20 minutes ago | parent | prev | next [-]

The compressed agents.md approach is interesting, but the comparison misses a key variable: what happens when the agent needs to do something outside the scope of its instructions?

With explicit skills, you can add new capabilities modularly - drop in a new skill file and the agent can use it. With a compressed blob, every extension requires regenerating the entire instruction set, which creates a versioning problem.

The real question is about failure modes. A skill-based system fails gracefully when a skill is missing - the agent knows it can't do X. A compressed system might hallucinate capabilities it doesn't actually have because the boundary between "things I can do" and "things I can't" is implicit in the training rather than explicit in the architecture.

Both approaches optimize for different things. Compressed optimizes for coherent behavior within a narrow scope. Skills optimize for extensibility and explicit capability boundaries. The right choice depends on whether you're building a specialist or a platform.

jstummbillig a few seconds ago | parent [-]

Why could you not have a combination of both?

smcleod an hour ago | parent | prev | next [-]

Sounds like they've been using skills incorrectly if they're finding their agents don't invoke the skills. I have Claude Code agents calling my skills frequently, almost every session. You need to make sure your skill descriptions are well defined and describe when to use them and that your tasks / goals clearly set out requirements that align with the available skills.

velcrovan an hour ago | parent [-]

I think if you read it, their agents did invoke the skills and they did find ways to increase the agents' use of skills quite a bit. But the new approach works 100% of the time as opposed to 79% of the time, which is a big deal. Skills might be working OK for you at that 79% level and for your particular codebase/tool set, that doesn't negate anything they've written here.

pietz an hour ago | parent | prev | next [-]

Isn't it obvious that an agent will do better if he internalizes the knowledge on something instead of having the option to request it?

Skills are new. Models haven't been trained on them yet. Give it 2 months.

WA an hour ago | parent [-]

Not so obvious, because the model still needs to look up the required doc. The article glances over this detail a little bit unfortunately. The model needs to decide when to use a skill, but doesn’t it also need to decide when to look up documentation instead of relying on pretraining data?

velcrovan an hour ago | parent | next [-]

Removing the skill does remove a level of indirection.

It's a difference of "choose whether or not to make use of a skill that would THEN attempt to find what you need in the docs" vs. "here's a list of everything in the docs that you might need."

sothatsit an hour ago | parent | prev [-]

I believe the skills would contain the documentation. It would have been nice for them to give more information on the granularity of the skills they created though.

jryan49 an hour ago | parent | prev | next [-]

Something that I always wonder with each blog post comparing different types of prompt engineering is did they run it once, or multiple times? LLMs are not consistent for the same task. I imagine they realize this of course, but I never get enough details of the testing methodology.

only-one1701 an hour ago | parent [-]

This drives me absolutely crazy. Non-falsifiable and non-deterministic results. All of this stuff is (at best) anecdotes and vibes being presented as science and engineering.

bluGill an hour ago | parent [-]

That is my experience. Sometimes the LLM gives good results, sometimes it does something stupid. You tell it what to do, and like a stubborn 5 year old it ignores you - even after it tries it and fails it will do what you tell it for a while and then go back to the thing that doesn't work.

sheepscreek 32 minutes ago | parent | prev | next [-]

It seems their tests rely on Claude alone. It’s not safe to assume that Codex or Gemini will behave the same way as Claude. I use all three and each has its own idiosyncrasies.

delduca 13 minutes ago | parent | prev | next [-]

Ah nice… vercel is vibecoded

rao-v an hour ago | parent | prev | next [-]

In a month or three we’ll have the sensible approach, which is smaller cheaper fast models optimized for looking at a query and identifying which skills / context to provide in full to the main model.

It’s really silly to waste big model tokens on throat clearing steps

Calavar an hour ago | parent [-]

I thought most of the major AI programming tools were already doing this. Isn't this what subagents are in Claude code?

MillionOClock 43 minutes ago | parent [-]

I don't know about Claude Code but in GitHub Copilot as far as I can tell the subagents are just always the same model as the main one you are using. They also need to be started manually by the main agent in many cases, whereas maybe the parent comment was referring about calling them more deterministically?

sothatsit an hour ago | parent | prev | next [-]

This seems like an issue that will be fixed in newer model releases that are better trained to use skills.

EnPissant 2 hours ago | parent | prev | next [-]

This is confusing.

TFA says they added an index to Agents.md that told the agent where to find all documentation and that was a big improvement.

The part I don't understand is that this is exactly how I thought skills work. The short descriptions are given to the model up-front and then it can request the full documentation as it wants. With skills this is called "Progressive disclosure".

Maybe they used more effective short descriptions in the AGENTS.md than they did in their skills?

NitpickLawyer 2 hours ago | parent | next [-]

The reported tables also don't match the screenshots. And their baselines and tests are too close to tell (judging by the screenshots not tables). 29/33 baseline, 31/33 skills, 32/33 skills + use skill prompt, 33/33 agent.md

sally_glance an hour ago | parent | prev [-]

I also thought this is how skills work, but in practice I experienced similar issues. The agents I'm using (Gemini CLI, Opencode, Claude) all seem to have trouble activating skills on their own unless explicitly prompted. Yeah, probably this will be fixed over the next couple of generations but right now dumping the documentation index right into the agent prompt or AGENTS.md works much better for me. Maybe it's similar to structured output or tool calls which also only started working well after providers specifically trained their models for them.

ares623 2 hours ago | parent | prev | next [-]

2 months later: "Anthropic introduces 'Claude Instincts'"

CjHuber an hour ago | parent | prev | next [-]

That feels like a stupid article. well of course if you have one single thing you want to optimize putting it into AGENTS.md is better. but the advantage of skills is exactly that you don't cram them all into the AGENTS file. Let's say you had 3 different elaborate things you want the agent to do. good luck putting them all in your AGENTS.md and later hoping that the agent remembers any of it. After all the key advantage of the SKILLs is that they get loaded to the end of the context when needed

thom an hour ago | parent | prev [-]

You need the model to interpret documentation as policy you care about (in which case it will pay attention) rather than as something it can look up if it doesn’t know something (which it will never admit). It helps to really internalise the personality of LLMs as wildly overconfident but utterly obsequious.