Remix.run Logo
ProofHouse 8 days ago

Can you elaborate on how the agents degrades from more tools? By paralysis or overuse? Isn’t this both ways a function of direction on correctly instructing which to use when? Tnx

lelanthran 8 days ago | parent | next [-]

The context window is limited. Using half your context window for tools means you have a 50% smaller context window.

On a large and complex system (not even a mini ERP system or even a basic bookkeeping system, but a small inventory mgmt system) you are going to have a few dozen tools, each with a description of parameters and return values.

For anything like an ERP system you are going to have a few thousands of tools, which probably wouldn't even fit in the context before the user supplied prompt.

This is why the only use case this far for genAI is coding: with a mere 7 tools you can do everything.

pillefitz 8 days ago | parent [-]

The problem of overflowing context is solved by RAGs, though.

lelanthran 8 days ago | parent | next [-]

> The problem of overflowing context is solved by RAGs, though.

No, it isn't.

It's mitigated with RAGs, but RAGs add to the context, and what they add might be irrelevant is all the retriever module is doing is plain text search.

If the retriever module is performing an embeddings/vector search on a properly prepared dataset you may have more luck, but it's still a piss-poor experience compared to simply putting all the tools into the context.

Of course, I'm not an expert, so I welcome corrections.

dragonwriter 8 days ago | parent | prev [-]

RAG mitigates somewhat the problem of insufficient context, it does not solve it.

diggan 8 days ago | parent | prev | next [-]

> Can you elaborate on how the agents degrades from more tools?

The more context you have in the requests, the worse the performance, I think this is pretty widely established at this point. For best accuracy, you need to constantly prune the context, or just begin from the beginning.

So with that, each tool you make available to the LLM for tool calling, requires you to actually put the definition (arguments, what it's used for, the name and so on) into the context.

So if you have 3 tools available, which are all relevant to the current prompt, you'd get better responses, compared to if you had 100 tools available, where only 3 are relevant, and the rest of the definitions are just filling the context for little point.

TLDR: context grows with each tool definition, more context == worse inference, so less tool definitions == better responses.

112233 8 days ago | parent | next [-]

Are there any easy to use inference frontends that support rewriting/pruning the context? Also, ideally, masking out chunks of kv-cache (e.g. old think blocks)?

Because I cannot find anything short of writing custom fork/app on top of hf transformers or llama.cpp

diggan 8 days ago | parent [-]

I tend to use my own "prompt management CLI" (https://github.com/victorb/prompta) to setup somewhat reusable prompts, then paste the output into whatever UI/CLI I use at the moment.

Then rewriting/pruning is a matter of changing the files on disk, rerun "prompta output", create a new conversion. I basically never go beyond one user message and one assistant message, seems to degrade really quickly otherwise.

danielrico 8 days ago | parent | prev [-]

I jumped off the boat of llm a little before MCP was a thing, so I thought that the tools were presented as needed by the prompt/context in a way not dissimilar of RAG. Isn't this the standard way?

jacobr1 8 days ago | parent [-]

You _can_ build things that way. But then you need some business logic to decide which tools to expose to the system. The easy/dumb way is just to give it all the tools. With RAG, you have retrieval step where you have hardcoded some kind of search (likely semantic) and some kind of pruning or relevance logic (maybe give the top 5 results that have at least X% relevancy matching).

With tools there is no equivalent. Maybe you could try some semantic similarity to the tool description, but I don't know of any system that does that.

What seems to be happening is building distinct "agents" that have a set of tools designed into them. An Agent is a system prompt+tools, where some of tools might be the ability to call/handoff to other agents. Each call to an agent is a new context, albeit with some limited context handed in from the caller agent. That way you are manually decomposing the project into a distinct set of sub-agents that can be concretely reasoned about and can perform a small set of related tasks. Then you need some kind of overall orchestration agent that can handle dispatch to other agents.

0x457 8 days ago | parent | prev | next [-]

Better if you see it for yourself. Setup GitHub MCP and enable all tools. It will start using wrong tools at wrong time, overuse it. Add languageserver-mcp, and it suddenly will start trying to use it for file edits and create a huge mess in files.

I have nixos mcp server available to search documentation and packages, but it often starts using it for entirely different things.

It's almost like when you tell someone not to think about an elephant, and they can't stop thinking - if you provide it with a tool, it will try to use it. That's why sub-agents are better because you can limit tool availability.

I use tidewave mcp and as soon as it uses a single tool from it, claude becomes obsessed with it, I saw it waste entire context running evals there without doing any file edits.

ramoz 8 days ago | parent | prev | next [-]

It’s not just context.

It is similar to paralysis - in that now every prompt the model has to reason over more tools to possibly decide to use - this is surely a deviation from training the more tools you add

datadrivenangel 8 days ago | parent | prev [-]

Imagine that for every task you receive, you also received a list of all the systems and tools you had access to.

So a JIRA ticket description might be several thousand lines long now when the actual task description is a few sentences. The ratio of signal to noise is now bad, and the risk of making mistakes goes up, and the models degrade.