Remix.run Logo
jpcompartir 3 days ago

You seem to be responding to a strawman, and assuming I think something I don't think.

As of today, 'bad' generations early in the sequence still do tend towards responses that are distant to the ideal response. This is testable/verifiable by pre-filling responses, which I'd advise you to experiment with for yourself.

'Bad' generations early in the output sequence are somewhat mitigatable by injecting self-reflection tokens like 'wait', or with more sophisticated test-time compute techniques. However, those remedies can simultaneously turn 'good' generations into bad, they are post-hoc heuristics which treat symptoms not causes.

In general, as the models become larger they are able to compress more of their training data. So yes, using the terminology of the commenter I was responding to, larger models should tend to have fewer 'compression artefacts' than smaller models.

ACCount37 3 days ago | parent [-]

With better reasoning training, the models mitigate more and more of that entirely by themselves. They "diverge into a ditch" less, and "converge towards the right answer" more. They are able to use more and more test-time compute effectively. They bring their own supply of "wait".

OpenAI's in-house reasoning training is probably best in class, but even lesser naive implementations go a long way.