Agent orchestration for the timid

When I want to learn code or understand a new architecture, I stick at stage 1. When I want to validate an idea, stage 5 and beyond makes perfect sense to go YOLO. I might have to try one of these orchestrators one day, but only when I'm regularly getting stopped cause I've hit my credit limit. For my current usage, I'm pretty happy where I'm at.

▲

px1999 2 hours ago | parent | prev | next [-]

Imo there's a huge blind spot forming between 6 and 8 when talking to people and in reading posts by various agent evangelists - few people seem to be focussing on building "high quality" changes vs maximising throughput of low quality work items.

My (boring b2b/b2e) org has scripts that wrap a small handful of agent calls to handle/automate our workflow. These have been incredibly valuable.

We still 'yolo' into PRs, use agents to improve code quality, do initial checks via gating. We're trying to get docs working through the same approach. We see huge value in automating and lightweight orchestration of agents, but other parts of the whole system are the bottleneck, so theres no real point in running more than a couple of agents concurrently - claude could already build a low quality version our entire backlog in a week.

Is anyone exploring the (imo more practically useful today) space of using agents to put together better changes vs "more commits"?

	▲	lemming 13 minutes ago \| parent \| next [-]
		Is anyone exploring the (imo more practically useful today) space of using agents to put together better changes vs "more commits"? Yes, I am, although not really in public yet. I use the pi harness, which is really easy to extend. I’m basically driving a deterministic state machine for each code ticket, which starts with refining a short ticket into a full problem description by interviewing me one question at a time, then converts that into a detailed plan with individual steps. Then it implements each step one by one using TDD, and each bit gets reviewed by an agent in a fresh context. So first tests are written, and they’re reviewed to ensure they completely cover the initial problem, and any problems are addressed. That goes round a loop till the review agent is happy, then it moves to implementation. Same thing, implementation is written, loop until the tests pass, then review and fix until the reviewer is happy. Each sub task gets its own commit. Then when all the tasks are done, there’s an overall review that I look at. Then if everyone is happy the commits get squashed and we move to manual testing. The agent comes up with a full list of manual tests to cover the change, sets up the test scenarios and tells me where to debug in the code while working through each test case so I understand what’s been implemented. So this is semi automated - I’m heavily involved at the initial refine stage, then I check the plan. The various implementation and review loops are mostly hands off, then I check the final review and do the manual testing obviously. This is definitely much slower than something like Gas Town, but all the components are individually simple, the driver is a deterministic program, not an agent, and I end up carefully reviewing everything. The final code quality is very good. I generally have 2-4 changes like this ongoing at any one time in tmux sessions, and I just switch between them. At some point I might make a single dashboard with summaries of where the process is up to on each, and whether it needs my input, but right now I like the semi manual process.
	▲	CuriouslyC 10 minutes ago \| parent \| prev \| next [-]
		I have a code quality analysis tool that I use to "un-slopify" AI code. It doesn't handle algorithms and code semantics, which are still the programmer's domain, but it does a pretty good job of forcing agents to dry out code, separate concerns, group code more intelligently and generally write decoupled quasi-functional code. It works quite well with the raph loop to deeply restructure codebases. https://github.com/sibyllinesoft/valknut
	▲	throwup238 an hour ago \| parent \| prev [-]
		> Is anyone exploring the (imo more practically useful today) space of using agents to put together better changes vs "more commits"? That’s what I’ve been focused on the last few weeks with my own agent orchestrator. The actual orchestration bit was the easy part but the key is to make it self improving via “workflow reviewer” agents that can create new reviewers specializing in catching a specific set of antipatterns, like swallowing errors. Unfortunately I've found that what decides acceptable code quality is very dependent on project, organization, and even module (tests vs internal utilities vs production services) so prompt instructions like "don't swallow errors or use unwrap" make one part of the code better while another gets worse, creating a conflict for the LLM. The problem is that model eval was already the hardest part of using LLMs and evaluating agents is even harder if not practically impossible. The toy benchmarks the AI companies have been using are laughably inadequate. So far the best I’ve got is “reimplement MINPACK from scratch using their test suite” which can take days and has to be manually evaluated.

▲

xyzsparetimexyz 2 hours ago | parent | prev | next [-]

What kind of basic ass CRUD apps are people even working on that they're on stage 5 and up? Certainly not anything with performance, visual, embedded or GPU requirements.

▲

IanCal 2 hours ago | parent | next [-]

I think you massively underestimate the number of useful apps that are crud and a bit of business logic and styling. They’re useful, can genuinely take time to build, can be unique every time, and yet not brand new research projects.

▲

xyzsparetimexyz 2 hours ago | parent [-]

No totally, I agree. But I don't think that anyone will be YOLO vibe coding massive changes into Blender or ffmpeg any time soon.

	▲	IanCal 2 hours ago \| parent [-]
		Probably not, though additions maybe - I added the feature where the sculpt tool turns as you move it around if I recall right, many moons ago - I don’t think it was that hard but was a useful change.

▲

tjr 2 hours ago | parent | prev [-]

What would be an example of something you think wouldn’t work with 5 or higher? Is there something about GPU programming that LLMs can’t handle?

▲

xyzsparetimexyz 2 hours ago | parent [-]

I doubt they'd do a very good job of debugging a gpu crash, or visual noise caused by forgotten synchronization, or odd looking shadows.

Mayybe for some things you could set it up so that the screen output is livestreamed back into the agent, but I highly doubt that anyone is doing that for agents like this yet

▲

throwup238 an hour ago | parent | next [-]

> Mayybe for some things you could set it up so that the screen output is livestreamed back into the agent, but I highly doubt that anyone is doing that for agents like this yet

What do you mean by streaming? LLMs aren’t that advanced yet where they can consume a live video feed but people have been feeding them screenshots from Playwright and desktop apps for years (Anthropic even released the Computer Use feature based on this).

Gemini has the best visual intelligence but all three of the major models have supported this for a while. I don’t think it’d help with fixing subtle problems in shadows but it can fix other gui bugs using visual feedback.

▲

jjmarr an hour ago | parent | prev [-]

I am a GPU programmer (on the compute side), and the biggest challenge is lack of tooling.

For host-side code the agent can throw in a bunch of logging statements and usually printf its way to success. For device-side code there isn't a good way to output debugging info into a textual format understandable by the agent. Graphical trace viewers are great for humans, not so great for AI right now.

On the other hand, Cline's harness can interact with my website and click on stuff until the bugs are gone.

	▲	akiselev 44 minutes ago \| parent [-]
		(Shamless plug) I've been using my debugger-cli [1] to enable agents to debug code using debuggers that support the Debug Adaptor Protocol. It looks like cuda-gdb supports DAP so I'd love to add support. I just need help from someone who can test it adequately. [1] https://github.com/akiselev/debugger-cli

▲

MarcelOlsz 3 hours ago | parent | prev [-]

Worth looking into Conductor.build and Sculptor as well, though I believe both are electron and run like sh*t but Conductor is quite good. Going to give this Vibe Kanban a go, thanks.

Orchestration is cool but a sane orchestration setup with VM's is where it's at.

I've been working on orchestration for the past little while and I've got a very tight loop going where everything is in worktrees and containerized, all services are isolated, so I can easily work on db schema/migration stuff while a few other agents do frontend work etc. Getting Conductor to play nice with vm's was very tricky as their docs say they have no intention of implementing vm's and wrote a "trust me bro, it won't erase your system" blurb about it in their docs [0]

[0] https://docs.conductor.build/faq#what-permissions-do-agents-...

	▲	xyzsparetimexyz 2 hours ago \| parent [-]
		Could you perhaps replace the VMs with bubblewrap instead?