Ensembling usually hits a wall at latency and cost. Running these in parallel is table stakes, but how are you handling the orchestration layer overhead when one provider (e.g., Vertex or Bedrock) spikes in P99 latency? If you're waiting for the slowest model to get entropy stats, the DX falls off a cliff. Are you using speculative execution or a timeout/fallback strategy to maintain a responsive ttft?

▲

supai 3 days ago | parent | next [-]

A few things:

- We do something similar to OpenRouter which measures the latency of the different providers, to ensure we always get the fastest results

- Users can cancel a single model stream if it's taking too long

- The orchestrator is pretty good at choosing what models for what task. The actual confidence scoring and synthesis at the end is the difficult part that you cannot do naively, however, the orchestrator plays the biggest part in optimizing cost + speed. I've made sure that we don't exceed 25% extra in cost or time in the vast majority of queries, compared to equivalent prompts in ChatGPT/Gemini/etc.

The reason why this is viable IMO is because of the fact that you can run multiple less-intelligent models with lower thinking efforts and beat a single more-intelligent model with a large thinking effort. The thinking effort reduction speeds up the prompt dramatically.

The sequential steps are then:

1. Ensemble RAG 2. Orchestrator 3. Models in parallel 4. Synthesizer

And retries for low-confidence (although that's pretty optimized with selective retries of portions of the answer).

▲

mememememememo 2 days ago | parent | prev [-]

You could timeout. You could trade them off dynamically.

I.e. you get 3 replies. 80% confidence. You decide at 80% you are fairly good but happy to wait 5 seconds for completion / 500ms for time to first token. If either breaches you give the current answer.

But if you are at 5% you wait for 60s total/2s for a token since the upside of that unspoken model is much higher.

Basically wagering time for quality in a dynamic prediction market in front of the LLM.

▲

kenmu 2 days ago | parent | next [-]

Love your idea. We have timeout mechanisms and we originally would be pretty aggressive with timeouts based on both time and response length to balance accuracy and speed. There’s research that longer responses tend to be less accurate (when compared to other responses to the same prompt). So we came up with an algorithm that optimized this very effectively. However, we eventually removed this mechanism to avoid losing any accuracy or comprehensiveness. We have other systems, including confidence scoring, that are pretty effective at judging long responses and weighting them accordingly.

We may reintroduce some of the above with user-configurable levers.

▲

all2 2 days ago | parent | prev [-]

If we treat LLM output as a manufacturing output if you have three 80% probabilities you actually have something like 0.80.80.8 -> 0.512 or 51%.

▲

scottmu 2 days ago | parent [-]

Yes, there's a wide variety of use cases that require different ratios of accuracy/speed. If you require 3 responses to be accurate, you have to multiply all 3 response accuracy probabilities, and as you've shown, this can reduce overall accuracy quite a bit. Of course, this does make the assumption that those 3 responses are independent of one another.

▲

all2 a day ago | parent [-]

One thing I considered some months ago that was very similar to what you guys have done, but at a higher abstraction layer:

1. Consult many models (or a single model with higher temp) with the same prompt

2. Intelligently chunk the outputs (by entity, concept, subject, etc.)

3. Put each chunk into a semantic bucket (similar chunks live in the same bucket)

4. Select winning buckets for each chunk.

4a. Optionally push the undervoted chunks back into the model contexts for followup: is this a good idea, does it fit with what you recommended, etc.

4b. do the whole chunk/vote thing again

5. Fuse outputs. Mention outliers.

Token spend is heavy here, where we rely on LLMs to make decisions instead of the underlying math you guys went with. IMO, the solution y'all have reached is far more elegant than my idea.

	▲	scottmu a day ago \| parent [-]
		I like the direction you're going with this strategy. There are many approaches, nuances, edge cases, and clever tricks to each of these steps, even without taking into account token probability distributions. Very powerful to get it right.