How was reinforcement learning used as a gamechanger?

What happens to an LLM without reinforcement learning?

▲ libraryofbabel 4 hours ago | parent | next [-]

The essence of it is that after the "read the whole internet and predict the next token" pre-training step (and the chat fine-tuning), SotA LLMs now have a training step where they solve huge numbers of tasks that have verifiable answers (especially programming and math). The model therefore gets the very broad general knowledge and natural language abilities from pre-training and gets good at solving actual problems (problems that can't be bullshitted or hallucinated through because they have some verifiable right answer) from the RL step. In ways that still aren't really understood, it develops internal models of mathematics and coding that allow it to generalize to solve things it hasn't seen before. That is why LLMs got so much better at coding in 2025; the success of tools like Claude Code (to pick just one example) is built upon it. Of course, the LLMs still have a lot of limitations (the internal models are not perfect and aren't like how humans think at all), but RL has taken us pretty far.

Unfortunately the really interesting details of this are mostly secret sauce stuff locked up inside the big AI labs. But there are still people who know far more than I do who do post about it, e.g. Andrej Karpathy discusses RL a bit in his 2025 LLMs Year in Review: https://karpathy.bearblog.dev/year-in-review-2025/

▲

brcmthrowaway 2 hours ago | parent [-]

Do you have the answer to the second question? Is an LLM trained on the internet just GPT-3?

	▲	libraryofbabel 2 hours ago \| parent [-]
		I don't know - perhaps someone who's more of an expert or who's worked a lot with open source models that haven't been RL-ed can weigh in here! But certainly without the RL step, the LLM would be much worse at coding and would hallucinate more.

▲ malaya_zemlya 2 hours ago | parent | prev [-]

You can download a base model (aka foundation, aka pretrain-only) from huggingface and test it out. These were produced without any RL.

However, most modern LLMs, even base models, would be not just trained on raw internet text. Most of them were also fed a huge amount of synthetic data. You often can see the exact details in their model cards. As a result, if you sample from them, you will notice that they love to output text that looks like:

  6. **You will win millions playing bingo.**
     - **Sentiment Classification: Positive**
     - **Reasoning:** This statement is positive as it suggests a highly favorable outcome for the person playing bingo.

This is not your typical internet page.

	▲	octoberfranklin an hour ago \| parent [-]
		You often can see the exact details in their model cards. Bwahahahaaha. Lol. /me falls off of chair laughing Come on, I've never found "exact details" about anything in a model card, except maybe the number of weights.