The most interesting thing in here is https://github.com/smhanov/laconic which is the author's "agentic research orchestrator for Go that is optimized to use free search & low-cost limited context window llms".

I have been doing this kind of thing with Cursor and Codex subscriptions, but they do have annoying rate limits, and Cursor on the Auto model seems to perform poorly if you ask it to do too much work, so I am keen to try out laconic on my local GPU.

EDIT:

Having tried it out, this may be a false economy.

The way it works is it has a bunch of different prompts for the LLMs (Planner, Synthesizer, Finalizer).

The "Planner" is given your input question and the "scratchpad" and has to come up with DuckDuckGo search terms.

Then the harness runs the DuckDuckGo search and gives the question, results, and scratchpad to the Synthesizer. The Synthesizer updates the scratchpad with new information that is learnt.

This continues in a loop, with the Planner coming up with new search queries and the Synthesizer updating the scratchpad, until eventually the Planner decides to give a final answer, at which point the Finalizer summarises the information in a user-friendly final answer.

That is a pretty clever design! It allows you to do relatively complex research with only a very small amount of context window. So I love that.

However I have found that the Synthesizer step is extremely slow on my RTX3060, and also I think it would cost me about £1/day extra to run the RTX3060 flat out vs idle. For the amount of work laconic can do in a day (not a lot!), I think I am better off just sending the money to OpenAI and getting the results more quickly.

But I still love the design, this is a very creative way to use a very small context window. And has the obvious privacy and freedom advantages over depending on OpenAI.

▲ andai 12 hours ago | parent [-]

Yeah, came here to mention that too!

From the article:

>To manage all this, I built laconic, an agentic researcher specifically optimized for running in a constrained 8K context window. It manages the LLM context like an operating system's virtual memory manager—it "pages out" the irrelevant baggage of a conversation, keeping only the absolute most critical facts in the active LLM context window.

The 8K part is the most startling to me. Is that still a thing? I worked under that constraint in 2023 in the early GPT-4 days. I believe Ollama still has the default context window set to 8K for some reason. But the model mentioned on laconic GitHub (Qwen3:4B) should support 32K. (Still pretty small, but.. ;)

I'll have to take a proper look at the architecture, extreme context engineering is a special interest of mine :) Back when Auto-GPT was a thing (think OpenClaw but in 2023), I realized that what most people were using it for was just internet research, and that you could get better results, cheaper, faster, and deterministically, by just writing a 30 line Python script.

Google search (or DDG) -> Scrape top N results -> Shove into LLM for summarization (with optional user query) -> Meta-summary.

In such straightforward, specialized scenarios, letting the LLM drive was, and still is, "swatting a fly with a plasma cannon."

(The analog these days would be that many people would be better off asking Claw to write a scraper for them, than having it drive Chromium 24/7...)

▲ jstanley 11 hours ago | parent [-]

> (The analog these days would be that many people would be better off asking Claw to write a scraper for them, than having it drive Chromium 24/7...)

Possibly. But possibly you have a very long tail of sites that you hardly ever look at, and that change more frequently than you use them, and maintaining the scraper is harder work than just using Chromium.

The dream is that the Claw would judge for itself whether to write a scraper or hand-drive the browser.

That might happen more easily if LLMs were a bit lazier. If they didn't like doing drudgery they would be motivated to automate it away. Unfortunately they are much too willing to do long, boring, repetitive tasks.

▲ andai 10 hours ago | parent [-]

Yeah, I think the ideal setup is two-tier.

extremely lazy, large model

extremely diligent Ralph

Not sure if top model should be the biggest one though. I hear opposite opinions there. Small model which delegates coding to bigger models, vs big model which delegates coding to small models.

The issue is you don't want the main driver to be big, but it needs to be big enough to have common sense w.r.t. delegating both up[0] and down...

[0] i.e. "too hard for me, I will ping Opus ..." :) do models have that level of self awareness? I wanna say it can be after a failed attempt, but my failure mode is that the model "succeeds" but the solution is total ass.

	▲	drewstiff 9 hours ago \| parent [-]
		Re: your footnote, Anthropic certainly seem to think so [0] [0] https://claude.com/blog/the-advisor-strategy