Remix.run Logo
ai-christianson 8 days ago

I've been building agents for a bit (RA.Aid OSS coding agent, now Gobii web browsing agents).

The main problem with MCP is that it just makes tools available for the agent to use. We get the best performance when there's a small set of tools and we actively prompt the agent on the best way to use the tools.

Simply making more tools available can give the agent more capabilities, but it can easily trash performance.

electric_muse 8 days ago | parent | next [-]

This is 100% a problem with the MCP spec: it does not currently provide a way to narrow what tools, and therefore context, flow into the LLM.

I don't really think there's an easy solution at the protocol level, since you can't just make the LLM say what tools it wants upfront. There's a whole discovery process during the handshake:

LLM(Host): Hi, I'm Claude Desktop, what do you offer?

MCP Server: Hi, I'm Salesforce MCP, I offer all these things: {...tools, prompts, resources, etc.}

Discoverability is one of the reasons MCP has a leg up on traditional APIs. (Sure, OpenAPI helps, but it's not quite the same thing.)

I'd be interested in hearing other recommendations or ideas, but when I saw this, I realized that the spec effectively necessitates a whole new layer exist: the gateway plane.

Basically, you need a place where the MCPs can connect & expose everything they offer. Then, via composability and settings, you can select what you want to pass through to the LLM (host), given the specific job it has.

I basically pivoted my company to start building one of these, and we're getting inundated right now.

This whole thing reminds me of the early web days, where the protocols and standards were super basic and loose, and we all just built systems and tools to fill those gaps. Just because MCP isn't "complete" doesn't mean it's not valuable. In fact, I think leaving some things to the community & commercial offerings is a great way for this tech to keep winning.

pengli1707 8 days ago | parent | next [-]

> This is 100% a problem with the MCP spec: it does not currently provide a way to narrow what tools, and therefore context, flow into the LLM.

it's not the business of MCP spec, it should be task/job level, different task may need different tools and MCP just supply entire tools it has. The tool pickup should be the take charge by llm. moreover maybe some has `while list` to include or exclude some tools, but it should not from MCP spec.

dragonwriter 8 days ago | parent | prev | next [-]

> This is 100% a problem with the MCP spec

No, its not.

It’s a problem with the design of agentic workflows. Its on a whole different level of the atack than the MCP spec.

It is a real issue, but not one that it makes sense tor the MCP spec to be concerned with.

jddj 8 days ago | parent | prev [-]

> OpenAPI helps, but it's not quite the same thing

I haven't dug into MCP yet, but can you give any examples as to why openapi isn't/wasn't enough?

miniatureape 8 days ago | parent | prev | next [-]

Is this the problem that Claude sub agents are supposed to be solving?

They say they’re are for preserving and managing context and I’ve been wondering if they help with the “too many tools” problem.

https://docs.anthropic.com/en/docs/claude-code/sub-agents

potatolicious 8 days ago | parent [-]

It can, but I remain deeply unconvinced that the sub-agent architecture works as well as advertised.

The trick with any layering like this is that your end-to-end reliability is subagent_reliability * routing_agent_reliabilitty. Neither are 100% (or anywhere close to it, let's be honest), so the multiplying probabilities are still going to trash your performance.

If you get routed to the correct subagent, then subsequent performance is likely to be solid - but that's because you've taken the `routing_agent_reliability` term out of the equation.

Routing agent reliability hinges pretty heavily on the subagents themselves and how semantically or linguistically similar they are. If you have subagents that are in wildly disparate domains it may work well, but if your subagents start overlapping (or just look like they overlap) then routing accuracy is likely going straight into the dumpster. And a mis-route is catastrophic in that setup.

For very specific agents (well-established workflows that cross multiple, well-defined, non-overlapping domains) the architecture may be suitable, but in terms of the holy grail of the omni-agent (i.e., a desktop app agent suitable for general use) I suspect we'll continue running into a brick wall.

ProofHouse 8 days ago | parent | prev | next [-]

Can you elaborate on how the agents degrades from more tools? By paralysis or overuse? Isn’t this both ways a function of direction on correctly instructing which to use when? Tnx

lelanthran 8 days ago | parent | next [-]

The context window is limited. Using half your context window for tools means you have a 50% smaller context window.

On a large and complex system (not even a mini ERP system or even a basic bookkeeping system, but a small inventory mgmt system) you are going to have a few dozen tools, each with a description of parameters and return values.

For anything like an ERP system you are going to have a few thousands of tools, which probably wouldn't even fit in the context before the user supplied prompt.

This is why the only use case this far for genAI is coding: with a mere 7 tools you can do everything.

pillefitz 8 days ago | parent [-]

The problem of overflowing context is solved by RAGs, though.

lelanthran 8 days ago | parent | next [-]

> The problem of overflowing context is solved by RAGs, though.

No, it isn't.

It's mitigated with RAGs, but RAGs add to the context, and what they add might be irrelevant is all the retriever module is doing is plain text search.

If the retriever module is performing an embeddings/vector search on a properly prepared dataset you may have more luck, but it's still a piss-poor experience compared to simply putting all the tools into the context.

Of course, I'm not an expert, so I welcome corrections.

dragonwriter 8 days ago | parent | prev [-]

RAG mitigates somewhat the problem of insufficient context, it does not solve it.

diggan 8 days ago | parent | prev | next [-]

> Can you elaborate on how the agents degrades from more tools?

The more context you have in the requests, the worse the performance, I think this is pretty widely established at this point. For best accuracy, you need to constantly prune the context, or just begin from the beginning.

So with that, each tool you make available to the LLM for tool calling, requires you to actually put the definition (arguments, what it's used for, the name and so on) into the context.

So if you have 3 tools available, which are all relevant to the current prompt, you'd get better responses, compared to if you had 100 tools available, where only 3 are relevant, and the rest of the definitions are just filling the context for little point.

TLDR: context grows with each tool definition, more context == worse inference, so less tool definitions == better responses.

112233 8 days ago | parent | next [-]

Are there any easy to use inference frontends that support rewriting/pruning the context? Also, ideally, masking out chunks of kv-cache (e.g. old think blocks)?

Because I cannot find anything short of writing custom fork/app on top of hf transformers or llama.cpp

diggan 8 days ago | parent [-]

I tend to use my own "prompt management CLI" (https://github.com/victorb/prompta) to setup somewhat reusable prompts, then paste the output into whatever UI/CLI I use at the moment.

Then rewriting/pruning is a matter of changing the files on disk, rerun "prompta output", create a new conversion. I basically never go beyond one user message and one assistant message, seems to degrade really quickly otherwise.

danielrico 8 days ago | parent | prev [-]

I jumped off the boat of llm a little before MCP was a thing, so I thought that the tools were presented as needed by the prompt/context in a way not dissimilar of RAG. Isn't this the standard way?

jacobr1 8 days ago | parent [-]

You _can_ build things that way. But then you need some business logic to decide which tools to expose to the system. The easy/dumb way is just to give it all the tools. With RAG, you have retrieval step where you have hardcoded some kind of search (likely semantic) and some kind of pruning or relevance logic (maybe give the top 5 results that have at least X% relevancy matching).

With tools there is no equivalent. Maybe you could try some semantic similarity to the tool description, but I don't know of any system that does that.

What seems to be happening is building distinct "agents" that have a set of tools designed into them. An Agent is a system prompt+tools, where some of tools might be the ability to call/handoff to other agents. Each call to an agent is a new context, albeit with some limited context handed in from the caller agent. That way you are manually decomposing the project into a distinct set of sub-agents that can be concretely reasoned about and can perform a small set of related tasks. Then you need some kind of overall orchestration agent that can handle dispatch to other agents.

0x457 8 days ago | parent | prev | next [-]

Better if you see it for yourself. Setup GitHub MCP and enable all tools. It will start using wrong tools at wrong time, overuse it. Add languageserver-mcp, and it suddenly will start trying to use it for file edits and create a huge mess in files.

I have nixos mcp server available to search documentation and packages, but it often starts using it for entirely different things.

It's almost like when you tell someone not to think about an elephant, and they can't stop thinking - if you provide it with a tool, it will try to use it. That's why sub-agents are better because you can limit tool availability.

I use tidewave mcp and as soon as it uses a single tool from it, claude becomes obsessed with it, I saw it waste entire context running evals there without doing any file edits.

ramoz 8 days ago | parent | prev | next [-]

It’s not just context.

It is similar to paralysis - in that now every prompt the model has to reason over more tools to possibly decide to use - this is surely a deviation from training the more tools you add

datadrivenangel 8 days ago | parent | prev [-]

Imagine that for every task you receive, you also received a list of all the systems and tools you had access to.

So a JIRA ticket description might be several thousand lines long now when the actual task description is a few sentences. The ratio of signal to noise is now bad, and the risk of making mistakes goes up, and the models degrade.

blitzar 8 days ago | parent | prev | next [-]

Perhaps tools trained into the model rather than exposed through prompting would mitigate the performance hit (but might affect model quality?).

diggan 8 days ago | parent [-]

This is where you start to fine-tune the weights, you can get pretty great results when it comes to specific tool calls with the right data.

dbreunig 8 days ago | parent | prev [-]

Came here to say this: people present MCP’s verbosity as all the context the LLM needs. But almost always, this isn’t the case.

I wrote recently, “ Connecting your model to random MCPs and then giving it a task is like giving someone a drill and teaching them how it works, then asking them to fix your sink. Is the drill relevant in this scenario? If it’s not, why was it given to me? It’s a classic case of context confusion.”

https://www.dbreunig.com/2025/07/30/how-kimi-was-post-traine...