Remix.run Logo
lordofgibbons 6 days ago

> but we have reached the limit of AI scaling with conventional methods

We've just only started RL training LLMs. So far, RL has not used more than 10-20% of the existing pre-training compute budget. There's a lot of scaling left in RL training yet.

am17an 6 days ago | parent | next [-]

Isn't this factually wrong? Grok-4 used as much compute on RL as they did on pre-training. I'm sure GPT-5 was the same (or even more)

sigmoid10 6 days ago | parent [-]

It was true for models up to o3, but there isn't enough public info to say much about GPT-5. Grok 4 seems to be the first major model that scaled RL compute 10x to near pre-training effort.

scellus 6 days ago | parent | prev | next [-]

Even with pretraining, there's no limit or wall in raw performance, just diminishing returns in terms of the current applications, and business rationale to serve lighter models given the current infrastructure and pricing (and applications). Algorithmic efficiency of inference on a given performance level has also advanced a couple of OOMs since 2022 (for sure a major part of that is about model architecture and training methods).

And it seems research is bottlenecked by computation.

alcinos 6 days ago | parent | prev | next [-]

> We've just only started RL training LLMs

That's just factually wrong. Even the original chatGPT model (based on gpt3.5, released in 2022) was trained with RL (specifically RLHF).

prasoon2211 6 days ago | parent | next [-]

RLHF is not the "RL" the parent is posting about. RLHF is specifically human driven reward (subjective, doesn't scale, doesn't improve the model "intelligence", just tweaks behavior) - which is why the labs have started calling it post-training, not RLHF, anymore.

True RL is where you set up an environment where an agent can "discover" solutions to problems by iterating against some kind of verifiable reward AND the entire space of outcomes is theoretically largely explorable by the agent. Maths and Coding are have proven amenable to this type of RL so far.

manscrober 6 days ago | parent | prev | next [-]

a) 2022 is not too long ago b) this was a first important step to usable ai but not scalable. I'd say "RL training" is not the same as RLHF.

bigyabai 6 days ago | parent | prev [-]

The original ChatGPT was like 3 years after the first usable transformer models.

whimsicalism 6 days ago | parent | prev [-]

It is still an open question whether RL will (at least easily) scale the same way as pretrain or whether it is more effective at elicitation.