▲ | miki123211 5 days ago | ||||||||||||||||
Most of the improvements in this model, basically everything except the longer context, image understanding and better pricing, are basically things that reinforcement learning (without human feedback) should be good at. Getting better at code is something you can verify automatically, same for diff formats and custom response formats. Instruction following is also either automatically verifiable, or can be verified via LLM as a judge. I strongly suspect that this model is a GPT-4.5 (or GPT-5???) distill, with the traditional pretrain -> SFT -> RLHF pipeline augmented with an RLVR stage, as described in Lambert et al[1], and a bunch of boring technical infrastructure improvements sprinkled on top. | |||||||||||||||||
▲ | clbrmbr 5 days ago | parent [-] | ||||||||||||||||
If so, the loss of fidelity versus 4.5 is really noticeable and a loss for numerous applications. (Finding a vegan restaurant in a random city neighborhood, for example.) | |||||||||||||||||
|