First impression: Third-party benchmarks or gtfo. Personally, I've never heard of either of these companies before. We're just supposed to take their word that they've matched the best models on the market?

Sakana describes their model as a "Orchestration Model." Does that mean that it's actually a bunch of different models glued together?

▲

lifeformed 9 hours ago | parent | next [-]

Is it actually that hard to make good models or is it just about the amount of resources you have to do training? (This is an actual question, I really don't know.) I'm sure it's not trivial but does it really take world class secret knowledge to build off of the known existing techniques? I feel like there's tons of low hanging fruit still to explore, and time and resources are the limiting factor.

▲

MostlyStable 9 hours ago | parent | next [-]

The gap between grok and Gemini to Claude and chatgpt suggests that yes it is that hard.

▲

arw0n 4 hours ago | parent [-]

I suspect that Grok has been ironically lobotomized by pressures to correct its political views.

Similarly, I could imagine the Gemini folks working in a significantly more complex corporate climate, with different parts of Google pushing for different capability focuses. They are only lagging behind less than a year, so it isn't too large of a gap yet.

That said, the fact that Anthropic is currently the top dog suggests that talent and execution is incredibly important. A year ago none of my normie friends new them, and when i suggested using Claude looked at me like when I recommend Linux.

▲

janalsncm 2 hours ago | parent [-]

That shouldn’t affect Grok’ coding ability. How often are people discussing politics with Claude code? Writing decent code is just hard and it’s not just Grok.

▲

thot_experiment 35 minutes ago | parent | next [-]

Not true, aggressive post training makes models notably dumber.

▲

bwhiting2356 2 hours ago | parent | prev | next [-]

It affects their ability to hire and retain talent.

	▲	janalsncm 2 hours ago \| parent [-]
		If training a good model requires talent then that’s the answer to the question this thread is trying to answer: is training a good model actually that hard?

▲

black_knight 2 hours ago | parent | prev [-]

Why would these be independent?

▲

janalsncm 2 hours ago | parent [-]

More specifically, political lobotomy shouldn’t affect coding ability.

	▲	girvo 43 minutes ago \| parent \| next [-]
		You’d be quite surprised, I think. Fine tuning a model on one axis can have drastic impacts on another that as a human we would expect to be completely unrelated.
	▲	Discordian93 an hour ago \| parent \| prev \| next [-]
		Yet empirically it does
	▲	Hamuko 30 minutes ago \| parent \| prev [-]
		It's all a bunch of weights isn't it? Why wouldn't fiddling with some parts of the weights have cascading effects?

▲

fwipsy 8 hours ago | parent | prev [-]

Not hard to be a fast follower. Lots of companies are ~6-9 months behind. Reaching the actual bleeding edge is much harder.

▲

alwa 2 hours ago | parent | prev | next [-]

My impression is that the answer is yes, that it purports to dispense the glue on-the-fly in some kind of dynamic way rather than being some kind of new model-amalgam.

See also contemporaneous reaction at:

https://news.ycombinator.com/item?id=48624782 (6 days ago, 244 points, 133 comments)

	▲	tough an hour ago \| parent [-]
		Also sakana has misrepresented their findings previously i think to remember [1] 1.https://www.reddit.com/r/singularity/comments/1iwbwgu/sakana...

▲

Ifkaluva 9 hours ago | parent | prev | next [-]

Their release post was on HN recently. The comments seemed to think that it was similar to OpenRouter, not an actual model.

▲

OutOfHere 9 hours ago | parent | prev [-]

Did Anthropic give you third-party benchmarks? Is that what you said to them? Yes, they're important, but the attitude is wrong.

▲

bloppe 9 hours ago | parent | next [-]

Anthropic always publishes 3p benchmarks every time they announce a new model

	▲	MostlyStable 9 hours ago \| parent [-]
		And even if they didn't, they have a track record. Even if we did have benchmarks in this case I would still wait until people got there hands on it and formed a more holistic opinion.

▲

fwipsy 5 hours ago | parent | prev [-]

Fudging benchmarks is a cheap way to get attention. If the model is really that good, it will have plenty of attention soon enough.

	▲	greenavocado 4 hours ago \| parent [-]
		Yeah, what happened to that scam startup that alleged to have made a model context window breakthrough a few weeks ago?