Remix.run Logo
peter_d_sherman 18 hours ago

>"Stage two was RLHF training on general chat prompts using a reward model to improve helpfulness. This worked. AlpacaEval scores jumped around 18.9 points on average compared to the fine-tuned checkpoints.

Then something broke. The RLHF stage, while improving chat quality, caused math benchmark scores to drop. GSM8K and DeepMind-Math both regressed."

Observation: Math (which when fully decomposed, results in Logic) is at the core of how computers (traditional/older, non-LLM, programming languages work. If an LLM gets Math training wrong at any stage for any reason, then, in my opinion, that should be viewed as something that needs to be fixed at a lower level, not a higher one; not a later training level...

I think it would be interesting exercise to train an LLM that only deals in simple Math, simple English, and only the ability to compute simple equations (+,-,x,/)... like, what's the absolute minimum in terms of text and layers necessary to train a model like that?

I think some interesting understandings could be potentially be had by experimentation like that...

I myself would love a pure (simplest, smallest possible)

Text-to-Math only LLM (TTMLLM, TTMSLM?)

, along with all of the necessary corpuses (which would ideally be as small as possible) and instructions necessary to train such an LLM...