Remix clone Hacker News

new | show | ask | jobs Github

	▲	miki123211 2 hours ago
		This is one area where reinforcement learning can help. The way you should think of RL (both RLVR and RLHF) is the "elicitation hypothesis[1]." In pretraining, models learn their capabilities by consuming large amounts of web text. Those capabilities include producing both low and high quality outputs (as both low and high quality outputs are present in their pretraining corpora). In post training, RL doesn't teach them new skills (see E.G. the "Limits of RLVR"[2] paper). Instead, it "teaches" the models to produce the more desirable, higher-quality outputs, while suppressing the undesirable, low-quality ones. I'm pretty sure you could design an RL task that specifically teaches models to use modern idioms, either as an explicit dataset of chosen/rejected completions (where the chosen is the new way and the rejected is the old), or as a verifiable task where the reward goes down as the number of linter errors goes up. I wouldn't be surprised if frontier labs have datasets for this for some of the major languages and packages. [1] https://www.interconnects.ai/p/elicitation-theory-of-post-tr... [2] https://limit-of-rlvr.github.io
	▲	munk-a 2 hours ago \| parent [-]
		I believe you absolutely could... as the model owner. The question is whether Go project owners can convince all the model trainers to invest in RL to fix their models and the follow up question is whether the single maintainer of some critical but obscure open source project could also convince the model trainers to commit to RL when they realize the model is horribly mistrained. In Stackoverflow data is trivial to edit and the org (previously, at least) was open to requests from maintainers to update accepted answers to provide more correct information. Editing is trivial and cheap to carry out for a database - for a model editing is possible (less easy but do-able), expensive and a potential risk to the model owner.