Remix clone Hacker News

new | show | ask | jobs Github

	▲	mrtesthah 5 hours ago
		>"is the RLHF judge happy with the answer." Reinforcement Learning with Verifiable Rewards (RLVR) to improve math and coding success rates seems like an exception.