| ▲ | roughly 6 hours ago | ||||||||||||||||||||||||||||||||||
One thing I’m becoming curious about with these models are the token counts to achieve these results - things like “better reasoning” and “more tool usage” aren’t “model improvements” in what I think would be understood as the colloquial sense, they’re techniques for using the model more to better steer the model, and are closer to “spend more to get more” than “get more for less.” They’re still valuable, but they operate on a different economic tradeoff than what I think we’re used to talking about in tech. | |||||||||||||||||||||||||||||||||||
| ▲ | Sol- 4 hours ago | parent | next [-] | ||||||||||||||||||||||||||||||||||
I also find the implications for this for AGI interesting. If very compute-intensive reasoning leads to very powerful AI, the world might remain the same for at least a few years even after the breakthrough because the inference compute simply cannot keep up. You might want millions of geniuses in a data center, but perhaps you can only afford one and haven't built out enough compute? Might sound ridiculous to the critics of the current data center build-out, but doesn't seem impossible to me. | |||||||||||||||||||||||||||||||||||
| |||||||||||||||||||||||||||||||||||
| ▲ | marcd35 6 hours ago | parent | prev | next [-] | ||||||||||||||||||||||||||||||||||
i'm no expert, and i actually asked google gemini a similar question yesterday - "how much more energy is consumed by running every query through Gemini AI versus traditional search?" turns out that the AI result is actually on par, if not more efficient (power wise) than traditional search. I think it said its the equivalent power of watching 5 seconds of TV per search. I also asked perplexity to give a report of the most notable ARXIV papers. This one was at the top of the list - "The most consequential intellectual development on arXiv is Sara Hooker's "On the Slow Death of Scaling," which systematically dismantles the decade-long consensus that computational scale drives progress. Hooker demonstrates that smaller models—Llama-3 8B and Aya 23 8B—now routinely outperform models with orders of magnitude more parameters, such as Falcon 180B and BLOOM 176B. This inversion suggests that the future of AI development will be determined not by raw compute, but by algorithmic innovations: instruction finetuning, model distillation, chain-of-thought reasoning, preference training, and retrieval-augmented generation. The implications are profound—progress is no longer the exclusive domain of well-capitalized labs, and academia can meaningfully compete again." | |||||||||||||||||||||||||||||||||||
| |||||||||||||||||||||||||||||||||||
| ▲ | mrandish 5 hours ago | parent | prev | next [-] | ||||||||||||||||||||||||||||||||||
> the token counts to achieve these results I've also been increasingly curious about better metrics to objectively assess relative model progress. In addition to the decreasing ability of standardized benchmarks to identify meaningful differences in the real-world utility of output, it's getting harder to hold input variables constant for apples-to-apples comparison. Knowing which model scores higher on a composite of diverse benchmarks isn't useful without adjusting for GPU usage, energy, speed, cost, etc. | |||||||||||||||||||||||||||||||||||
| ▲ | nielsole 5 hours ago | parent | prev | next [-] | ||||||||||||||||||||||||||||||||||
Pareto frontier is the term you are looking for | |||||||||||||||||||||||||||||||||||
| ▲ | retinaros 5 hours ago | parent | prev [-] | ||||||||||||||||||||||||||||||||||
yes. reasoning has a lot of scammy features. just look the number of tokens to nswer on bench and you will see that some models are just awful | |||||||||||||||||||||||||||||||||||