Remix.run Logo
miki123211 5 days ago

Most of the improvements in this model, basically everything except the longer context, image understanding and better pricing, are basically things that reinforcement learning (without human feedback) should be good at.

Getting better at code is something you can verify automatically, same for diff formats and custom response formats. Instruction following is also either automatically verifiable, or can be verified via LLM as a judge.

I strongly suspect that this model is a GPT-4.5 (or GPT-5???) distill, with the traditional pretrain -> SFT -> RLHF pipeline augmented with an RLVR stage, as described in Lambert et al[1], and a bunch of boring technical infrastructure improvements sprinkled on top.

[1] https://arxiv.org/abs/2411.15124

clbrmbr 5 days ago | parent [-]

If so, the loss of fidelity versus 4.5 is really noticeable and a loss for numerous applications. (Finding a vegan restaurant in a random city neighborhood, for example.)

weird-eye-issue 5 days ago | parent [-]

In your example the LLM should not be responsible for that directly. It should be calling out to an API or search results to get accurate and up-to-date information (relatively speaking) and then use that context to generate a response

clbrmbr 5 days ago | parent [-]

You should actually try it. The really big models (4 and 4.5, sadly not 4o) have truly breathtaking ability to dig up hidden gems that have a really low profile on the internet. The recommendations also seem to cut through all the SEO and review manipulation and deliver quality recommendations. It really all can be in one massive model.