▲ | starchild3001 5 days ago | |||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
Really appreciate the depth of this paper; it's a welcome change from the usual model announcement blog posts. The Zhipu/Tsinghua team laid out not just the 'what' but the 'how,' which is where the most interesting details are for anyone trying to build with or on top of these models. The post-training methodology (Sec 3) is what really stands out to me. The idea of creating specialized 'expert models' for reasoning, agents, and chat, and then distilling their capabilities into a final unified model is a fascinating approach. It feels like a more structured way to solve the "jack of all trades, master of none" problem that can plague generalist models. Instead of just mixing all the data, they're essentially having a generalist learn from a committee of specialists. A couple of the findings from their RL experiments are pure gold for anyone working in this space. The counter-intuitive result that a single-stage RL process at the full 64K context length outperforms a progressive, multi-stage approach (Fig 6) is a fantastic lesson. I've seen teams assume the opposite would be true. Also, the pragmatic choice to use an XML-like template for function calls to avoid JSON escaping hell (Fig 4) may be a small but brilliant engineering decision that makes a huge difference in practice. Wrangling escaped code inside JSON turns out to be a mess. The performance on SWE-bench is impressive, putting it in the same league as much larger or proprietary models. What I’d love to see, and maybe others here have thoughts, is whether this hybrid training recipe holds up outside ARC-style evals. For example, do the agentic improvements transfer to messier, real-world workflows where APIs are undocumented, partial failures are common, and user input is full of ambiguity? | ||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
▲ | algo_trader 5 days ago | parent | next [-] | |||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
Are all these "post/mid-training tweaks" important if you have a specific domain with abundant/verified/synthesis data and labels? Can a small team working on ASI/domain-specific stick to scaling 2024-era best practices training stack? Or will they miss massive improvements? | ||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
| ||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
▲ | calmoo 5 days ago | parent | prev [-] | |||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
I don't want to call you out unnecessarily, but your writing heavily smells of LLMs. edit: looks like i'm not the first person to notice this either regarding this poster. https://news.ycombinator.com/item?id=44279662 I think we have a duty to call this out, before the web becomes ridden with slop. | ||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
|