| ▲ | Retr0id 3 hours ago | |
> RLVR is weirder, and I suspect it's why we see "It's not X, it's Y" so often. This feels like an easy enough hypothesis to verify, for anyone in the business of training LLMs - does the not-X-but-Y rate increase after RLVR? | ||
| ▲ | andy99 3 hours ago | parent [-] | |
It’s unlikely this is true. LLMs are way more mad-libs / templates than we like to admit, that’s (ironically) not a judgement about their capability, it’s primarily just an observation. But it’s also what plain old SFT, which I believe is the primary culprit, ends up imparting. | ||