▲ | Making LLMs Cheaper and Better via Performance-Efficiency Optimized Routing(arxiv.org) | |||||||||||||||||||||||||||||||||||||||||||||||||||||||
129 points by omarsar 3 days ago | 28 comments | ||||||||||||||||||||||||||||||||||||||||||||||||||||||||
▲ | hodgehog11 3 days ago | parent | next [-] | |||||||||||||||||||||||||||||||||||||||||||||||||||||||
Wow, that was fast. I've thought for a while that ensembling approaches would become the next stage of LLM development after CoT, since it provides yet another effective, independent axis for scaling laws. Great to see that perspective is taking off. The open weight community has an opportunity to take these ideas and run with them better than OpenAI has. | ||||||||||||||||||||||||||||||||||||||||||||||||||||||||
| ||||||||||||||||||||||||||||||||||||||||||||||||||||||||
▲ | whistle650 3 days ago | parent | prev | next [-] | |||||||||||||||||||||||||||||||||||||||||||||||||||||||
It seems they use 70% of the benchmark query-answer pairs to cluster and determine which models work best for each cluster (by sending all queries to all models and looking at responses vs ground truth answers). Then they route the remaining 30% "test" set queries according to those prior determinations. It doesn't seem surprising that this approach would give you Pareto efficiency on those benchmarks. | ||||||||||||||||||||||||||||||||||||||||||||||||||||||||
| ||||||||||||||||||||||||||||||||||||||||||||||||||||||||
▲ | bachittle 3 days ago | parent | prev | next [-] | |||||||||||||||||||||||||||||||||||||||||||||||||||||||
I’m fascinated by this new paradigm. We’ve more or less perfected Mixture-of-Experts inside a single model, where routing happens between subnetworks. What GPT-5 auto (and this paper) are doing is a step further: “LLM routing” across multiple distinct models. It’s still rough right now, but it feels inevitable that this will get much better over time. | ||||||||||||||||||||||||||||||||||||||||||||||||||||||||
| ||||||||||||||||||||||||||||||||||||||||||||||||||||||||
▲ | mgreg 3 days ago | parent | prev | next [-] | |||||||||||||||||||||||||||||||||||||||||||||||||||||||
Link to repo for those interested: https://github.com/ZhangYiqun018/AvengersPro | ||||||||||||||||||||||||||||||||||||||||||||||||||||||||
▲ | hobofan 3 days ago | parent | prev | next [-] | |||||||||||||||||||||||||||||||||||||||||||||||||||||||
That's almost the most simple kind of router imaginable, isn't it? Just embed the query and route to the model that in the past has performed the best on similar queries? I'm sure that has been documented/tried before, and this almost certainly doesn't work in practice. The typical counter-example would be to take a simple-sounding query that actually requires complex reasoning, but because the query is close in the embedding space to other simple-sounding queries, it would be sent to a "dumber model" for efficency. I guess in their benchmarks that works out, because from what it sounds like, they do per-dataset clustering, so the embedding clusters may actually be able to cluster "complexity levels". However, if you were to mix all datasets into one (similar to how you would encounter it for most real-world use-cases) and cluster against that, this approach would surely break down. | ||||||||||||||||||||||||||||||||||||||||||||||||||||||||
▲ | srekhi 3 days ago | parent | prev | next [-] | |||||||||||||||||||||||||||||||||||||||||||||||||||||||
Isn't this what NotDiamond (founded 2 years ago!) has been working to solve for? Maybe someone from their team will chime in (cc @t5-notdiamond) | ||||||||||||||||||||||||||||||||||||||||||||||||||||||||
| ||||||||||||||||||||||||||||||||||||||||||||||||||||||||
▲ | biggestfan 3 days ago | parent | prev | next [-] | |||||||||||||||||||||||||||||||||||||||||||||||||||||||
Between these kinds of optimizations, improved data center efficiency, and smaller models being more capable, I wonder how long it will be before someone manages to make a profitable AI business. Maybe when they race to train better models slows down and they don't need to constantly upgrade capacity. | ||||||||||||||||||||||||||||||||||||||||||||||||||||||||
| ||||||||||||||||||||||||||||||||||||||||||||||||||||||||
▲ | visarga 3 days ago | parent | prev | next [-] | |||||||||||||||||||||||||||||||||||||||||||||||||||||||
Essentially, instead of modifying the prompt itself, the system intelligently directs the prompt to the LLM that is best suited to handle it based on its learned performance and efficiency characteristics for similar types of queries. It's externally optimizing people's prompts. | ||||||||||||||||||||||||||||||||||||||||||||||||||||||||
| ||||||||||||||||||||||||||||||||||||||||||||||||||||||||
▲ | datadrivenangel 3 days ago | parent | prev | next [-] | |||||||||||||||||||||||||||||||||||||||||||||||||||||||
Paper and repo do not mention routing latency, which I think is a concern. Also the paper has some pie chart crimes on page 6. | ||||||||||||||||||||||||||||||||||||||||||||||||||||||||
| ||||||||||||||||||||||||||||||||||||||||||||||||||||||||
▲ | PeterStuer 3 days ago | parent | prev | next [-] | |||||||||||||||||||||||||||||||||||||||||||||||||||||||
I would prefer this to be optional. | ||||||||||||||||||||||||||||||||||||||||||||||||||||||||
▲ | cubefox 3 days ago | parent | prev | next [-] | |||||||||||||||||||||||||||||||||||||||||||||||||||||||
Based on my experience, the GPT-5 router either isn't very smart or is deliberately configured to be very stingy. It basically never uses the reasoning model by itself, even if that means it hallucinates nonsense. | ||||||||||||||||||||||||||||||||||||||||||||||||||||||||
| ||||||||||||||||||||||||||||||||||||||||||||||||||||||||
▲ | retinaros 2 days ago | parent | prev [-] | |||||||||||||||||||||||||||||||||||||||||||||||||||||||
why do we always come up with new words for basic ideas. test time compute, test time router, test time sleep, test time slop. its a router lets call it router. at the end most of those principles are not part of the LLM but part of the API design in front of the LLM. I understand the goal is trying to abstract this fact to sell more magic. |