Solving a Million-Step LLM Task with Zero Errors

▲ Solving a Million-Step LLM Task with Zero Errors(arxiv.org)

64 points by Anon84 4 hours ago | 29 comments

▲ cs702 3 hours ago | parent | next [-]

Nice!

Briefly, the idea is recursively to decompose tasks into the simplest possible steps, recursively call (relatively small) LLMs as agents to execute one step at a time, and using a clever voting scheme to choose how to execute each step. The authors use this technique to get a relatively small LLM to solve Towers of Hanoi with 20 rings (1M steps). All of it using natural language.

The most obvious question is whether other tasks, more interesting -- less "rote" -- than Towers of Hanoi, can similarly be recursively decomposed into simple steps. I'm not sure that's always possible.

▲

patcon 16 minutes ago | parent | next [-]

> into the simplest possible steps, recursively call (relatively small) LLMs as agents to execute one step at a time, and using a clever voting scheme to choose how to execute each step.

It's like humans! Everything old is new again :)

▲

wordpad an hour ago | parent | prev | next [-]

This works because a problem could be broken down to a prompt which rarely hallucinates.

Most real world prompts can't be reduced to something so consistent and reliable.

Their key finding was that the number of votes grows linearly with number of prompts you are trying to chain.

However the issue is that the number of votes you need will grow exponentially with hallucination rate.

▲

adastra22 2 hours ago | parent | prev | next [-]

Why not? That's basically how NASA manages large projects.

▲

pinkmuffinere an hour ago | parent | next [-]

Reasoning by analogy is great for intuition, but doesn’t guarantee real results hold. Consider “voltage is like water pressure in pipes, so if there’s a cut in my wire’s insulation, the device won’t get enough voltage” — clearly this is not true, even though it relies on an analogy that’s generally useful.

▲

alwa an hour ago | parent | next [-]

I really like that analogy, thank you for it. Also applies to “it’s overvoltage, so I just need to poke a little hole in it to let the excess bleed out”…

	▲	wat10000 3 minutes ago \| parent [-]
		That one can work, briefly, depending on how conductive your tool is.

▲

CamperBob2 23 minutes ago | parent | prev [-]

Well, corona losses are a thing, after all.

▲

Uehreka 2 hours ago | parent | prev | next [-]

One issue I often run into with this stuff is the tightly coupled nature of things in the real world. I’ll fashion an example:

Let’s say you break a job down into 3 tasks: A, B and C. Doing one of those tasks is too much for an LLM to accomplish in one turn (this is something you learn intuitively through experience), but an LLM could break each task into 3 subtasks. So you do that, and start by having the LLM break task A into subtasks A1, A2 and A3. And B into B1, B2 and B3. But when you break down task C, the LLM (which needs to start with a fresh context each time since each “breakdown” uses 60-70% of the context) doesn’t know the details of task A, and thus writes a prompt for C1 that is incompatible with “the world where A1 has been completed”.

This sort of “tunnel vision” is currently an issue with scaling 2025 agents. As useful context lengths get longer it’ll get easier, but figuring out how to pack exactly the right info into a context is tough, especially when the tool you’d reach for to automate it (LLMs) are the same tool that suffers from these context limitations.

None of this means big things aren’t possible, just that the fussyness of these systems increases with the size of the task, and that fussyness leads to more requirements of “human review” in the process.

▲

th0ma5 11 minutes ago | parent | prev | next [-]

This is a really good analogy because the complex intersections between multiple groups independently working and trying to collaborate together into a collaborative hierarchy towards one large goal was one of the things that hid a lot of the problems that led to the Challenger disaster, according to Feynmen.

▲

Julien_r2 2 hours ago | parent | prev | next [-]

I could imagine that even a small task at NASA might involve more knowledge and logic than the smallest task for a Hanoi's tower problem.

Depends on what is considered as small enough for the LLM to be resolved with a high confidence.

▲

etamponi 2 hours ago | parent | prev | next [-]

"basically" is doing a lot of work in this sentence.

▲

mulmen 2 hours ago | parent | prev [-]

NASA has done a lot of amazing things but I wouldn’t bet on them winning a Super Bowl.

	▲	HarHarVeryFunny 36 minutes ago \| parent [-]
		They'd have a 50% chance of winning one on Mars, since it would just be NASA vs China

▲

naasking an hour ago | parent | prev [-]

> All of it using natural language.

Combining this with those approaches that recursively reason in latent space would be interesting.

▲ htrp 2 hours ago | parent | prev | next [-]

> The approach relies on an extreme decomposition of a task into subtasks, each of which can be tackled by focused microagents. The high level of modularity resulting from the decomposition allows error correction to be applied at each step through an efficient multi-agent voting scheme.

Big if that the decomposition and the voting happen accurately for anything other than toy problems

	▲	yorwba 2 hours ago \| parent [-]
		The approach in the paper specifically addresses the case where an LLM can usually solve a task when it requires few steps, but fails for the same kind of task with more steps because it randomly gets a step in the middle wrong and then derails. It can't do anything for tasks that the LLM can't solve even when there's just a few steps. In other words, it compensates for random error, not systematic error.

▲ andai 2 hours ago | parent | prev | next [-]

Worth opening the pdf just for the graph on page 1.

	▲	arscan an hour ago \| parent \| next [-]
		In case you want to know what’s going on in the left side of that chart, they gave a log scale in appendix a. I was thinking it was silly to not just use that version on the top, but I guess log scales make big differences ’feel’ smaller.
	▲	mNovak an hour ago \| parent \| prev [-]
		Really seems like the reason logarithmic scales were invented..

▲ vatsachak 16 minutes ago | parent | prev | next [-]

And you can decompose the proof of Fermat's last theorem into logical combinators.

The meat is in decomposing the difficult problem into steps

▲ andai 2 hours ago | parent | prev | next [-]

I have ADHD and the same approach works for me. (In fact, most days it is essential!)

	▲	binary132 35 minutes ago \| parent [-]
		do you have an algorithm for breaking down, organizing, and scheduling the small tasks, though? can it also be broken down?

▲ mattpk an hour ago | parent | prev | next [-]

Here is the pseudocode of MAKER:

  state = init_state()
  while state is not complete:
    state = LLM("You are a helpful assistant. The rules and format of the game is [...]. The correct strategy to use at each step is [...]. The current state is [...]. Output the state after making the next move")

▲ LMKIIW 3 hours ago | parent | prev | next [-]

I dunno, even though the authors address its use, making the task Tower of Hanoi doesn't meet the excitement of the title.

	▲	charcircuit an hour ago \| parent \| next [-]
		Especially since it's a recursive problem so each step is naturally broken up into subtasks. And the algorithm of what subtasks to break it up in to is public. This makes it much easier for it to get down to a case that the LLM can reliable solve.
	▲	NitpickLawyer 2 hours ago \| parent \| prev [-]
		And yet the reverse paper was posted ad nauseam, covered by every news slop site, and overblown with really negative takes.

▲ awei 2 hours ago | parent | prev | next [-]

one issue I see is when steps in a plan depend on one another, when you cannot know all the next steps exactly before seeing the results of the previous ones, when you may have to backtrack sometimes

▲ zer00eyz 2 hours ago | parent | prev [-]

On the surface this is an interesting concept...

The paper however, meh...

No mention of MoE. One would think this is a logical evolution of that but not a mention (that I saw). Its own rubric for the task, Towers of Hanoi, was admittedly weak.

LLM papers are starting to look like the last decade of JS frameworks and Tools. Only with less code and more academics, and thats disappointing, because I think a lack of pragmatism and grounding is now holding the field back...