Remix.run Logo
sigmar 4 hours ago

>various open-weighted Chinese models out there. They've kept good pace with flagship models,

I don't think this is accurate. Maybe it will change in the future but it seems like the Chinese models aren't keeping up with actually training techniques, they're largely using distillation techniques. Which means they'll always be catching up and never at the cutting edge. https://x.com/Altimor/status/2024166557107311057

A_D_E_P_T 4 hours ago | parent | next [-]

> they're largely using distillation techniques. Which means they'll always be catching up and never at the cutting edge.

You link to an assumption, and one that's seemingly highly motivated.

Have you used the Chinese models? IMO Kimi K2.5 beats everything but Opus 4.6 and Gemini 3.1... and it's not exactly inferior to the latter, it's just different. It's much better at most writing tasks, and its "Deep Research" mode is by a wide margin the best in the business. (OpenAI's has really gone downhill for some reason.)

nwlieb 3 hours ago | parent [-]

Have you tried the OpenAI deep research in the past week or so? It's been updated to use 5.2 https://x.com/OpenAI/status/2021299935678026168

(I work at OpenAI, but on the infra side of things not on models)

parliament32 2 hours ago | parent | prev | next [-]

Does that actually matter? If "catching up" means "a few months behind" at worst for.. free?

sigmar 2 hours ago | parent [-]

For certain use-cases, sure it doesn't matter. but that doesn't make those models cutting edge. Some use-cases are adversarial, and 1% lower efficacy matters a lot.

arthurcolle 3 hours ago | parent | prev [-]

I have been using a quorum composed of step-3.5-flash, Kimi k2.5 and glm-5 and I have found it outperforms opus-4.5 at a fraction of the cost

That's pretty cutting edge to me.

EDIT: It's not a swarm — it's closer to a voting system. All three models get the same prompt simultaneously via parallel API calls (OpenAI-compatible endpoints), and the system uses weighted consensus to pick a winner. Each model has a weight (e.g. step-3.5-flash=4, kimi-k2.5=3, glm-5=2) based on empirically observed reliability.

The flow looks like:

  1. User query comes in
  2. All 3 models (+ optionally a local model like qwen3-abliterated:8b) get called in parallel
  3. Responses come back in ~2-5s typically
  4. The system filters out refusals and empty responses
  5. Weighted voting picks the winner — if models agree on tool use (e.g. "fetch this URL"), that action executes
  6. For text responses, it can also synthesize across multiple candidates
The key insight is that cheap models in consensus are more reliable than a single expensive model. Any one of these models alone hallucinates or refuses more than the quorum does collectively. The refusal filtering is especially useful — if one model over-refuses, the others compensate.

Tooling: it's a single Python agent (~5200 lines) with protocol-based tool dispatch — 110+ operations covering filesystem, git, web fetching, code analysis, media processing, a RAG knowledge base, etc. The quorum sits in front of the LLM decision layer, so the agent autonomously picks tools and chains actions. Purpose is general — coding, research, data analysis, whatever. I won't include it for length but I just kicked off a prompt to get some info on the recent Trump tariff Supreme Court decision: it fetched stock data from Benzinga/Google Finance, then researched the SCOTUS tariff ruling across AP, CNN, Politico, The Hill, and CNBC, all orchestrated by the quorum picking which URLs to fetch and synthesizing the results, continuing until something like 45 URLs were fully processed. Output was longer than a typical single chatbot response, because you get all the non-determinism from what the models actually ended up doing in the long-running execution, and then it needs to get consensus, which means all of the responses get at least one or N additional passes across the other models to get to that consensus.

  Cost-wise, these three models are all either free-tier or pennies per million tokens. The entire session above (dozens of quorum rounds, multiple web fetches) cost less than a single Opus prompt.
earth2mars 3 hours ago | parent | next [-]

When you say quorum what do you mean? Is it like an agent swarm or using all of them in your workflow and independently they perform better than opus? Curious how you use (tooling and purpose - coding?)

tmaly 2 hours ago | parent | prev [-]

I have not heard of step-3.5-flash before. But as the other commenter asked, I would love to hear about your quorum technique. What type of projects are you building with the quorum?