I've been able to avoid context size issues by applying one simple constraint to my agent loop. What I do is prevent all tool calling in the user's top-level conversation thread. Anything that needs to tool call must happen in a recursive invoke of the agent, which returns whatever results to caller.

I can keep the same high level conversation going for an entire day over a million LOC+ codebase without ever hitting meaningful token limits. No compaction or summarization tricks needed. I can burn 50 million tokens in recursive calls and still not touch 100k tokens in my root conversation thread.

There is some rework needed to "bootstrap" the agent each time it has to descend back into Narnia, but this is still far more efficient than carrying around one big flat context that tries to cover everything all the time.

Recursion is very effective at controlling token use, but it can only go so far. I've not observed any uplift for recursive depth beyond 1. I have seen the agent attempt it a few times, but the practical performance is simply not there. External symbolic recursion does not appear to be something the frontier models have been trained for. They are fantastic at emulating recursion in context, but we don't want that if we are trying to achieve a reduction in token use.

▲

nijave 2 hours ago | parent | next [-]

Claude Code seems to automatically do this in some cases. It seems to have some heuristic "will eat a lot of context" where it decides to dispatch a sub agent.

I see it pretty frequently in troubleshooting and data analysis flows where it will dump the data collection and aggregation into a sub agent then pull out a summarized result.

I'll do something similar where I have the main agent maintain context in a design doc/markdown file and update as it goes along. Then I can clear/restart/handoff at will

▲

gbro3n 4 hours ago | parent | prev | next [-]

This makes intuitive sense. Can I ask what harness you're using that allows you to configure the constraint and how?

	▲	29 minutes ago \| parent [-]
		[deleted]

▲

KronisLV 4 hours ago | parent | prev | next [-]

For anyone using Claude Code, ask it to do all the work in workflows (it has a tool for that), they released that feature together with Opus 4.8 and it also seems a bit better at doing long tasks as well. The main conversation just orchestrates the work at that point.

▲

password4321 4 hours ago | parent | prev | next [-]

This is interesting to me because reducing context & token usage is in the user's best interest but not in the financial interest of AI vendors. I am not an expert but it sounds like your "one simple trick" would fix context issues and allow much tighter control over token usage. Thanks for being willing to share this tip in an HN comment, changing how those in the know use AI agents going forward -- it's hard to keep up!

▲

loeg 21 minutes ago | parent | next [-]

> This is interesting to me because reducing context & token usage is in the user's best interest but not in the financial interest of AI vendors.

AI vendors still need to compete with each other both in terms of token cost and competency. An agent that is costly and less effective by wasting tokens is less competitive.

▲

Jgrubb 2 hours ago | parent | prev [-]

The tokens are still being burnt, they're just doing so in a parallel dimension from the users main context window.

	▲	ajmurmann an hour ago \| parent \| next [-]
		It's true that the initial tool response still has the same amount of tokens but it doesn't keep dragged along in the longer-lived top context.
	▲	ViewTrick1002 2 hours ago \| parent \| prev [-]
		The real benefit is being able to use a cheaper, but good enough, model with a specific system prompt dedicated to that task.

▲

Muromec 4 hours ago | parent | prev | next [-]

I have a different way, but still trying to figure out how well it works. Instead of going into recursion, the agent is allowed to restart the thread by doing the summarize/debrief/reflect pass, writing key findings into persistent memory and rewriting the prompt whenever the context goes too large or it gets stuck. Recursion with TCO if you may.

In a way it's a generalization of the spec-driver approach, but in addition to the the formal spec the carryover buffer lives in the memory.

	▲	stogot 3 hours ago \| parent [-]
		Kiro does this automatically from what I can tell using it

▲

Etheryte 5 hours ago | parent | prev | next [-]

How do you get the agent to stick to it without constantly rejecting tool calls with the same description? I've tried a similar setup a number of times and it tends to forget about this constraint very quickly.

	▲	bob1029 5 hours ago \| parent \| next [-]
		The tool itself enforces the constraint. This is deterministic. If an agent tries to read a big fat file in root, it gets an error from that tool's implementation that reiterates the requirement. I don't bother warning it in the system prompt anymore. It's pointless. I let it bump its head as required. A few hundred tokens and the agent is back on track each time.
	▲	embedding-shape 5 hours ago \| parent \| prev [-]
		If the model isn't following the system/developer prompts easily, you might want to try a bigger/better model, tends to mostly be about model quality if it doesn't follow what you tell it to. Besides that, conflicting directions in the system/developer prompts can lead to the model seemingly ignoring instructions too.

▲

throwaway314155 3 hours ago | parent | prev | next [-]

How do you get something like this set up?

▲

WithinReason 5 hours ago | parent | prev [-]

Which tools? Even file reads and writes?

▲

bob1029 5 hours ago | parent [-]

Especially these things.

The only tools permissible to root in my scheme are call() and return().

	▲	WithinReason 4 hours ago \| parent [-]
		Is it in pi.dev? Don't thinking tokens still take up context?