Remix clone Hacker News

new | show | ask | jobs Github

	▲	sgsjchs an hour ago
		The trick is to provide dense rewards, i.e. not only once full goal is reached, but a little bit for every random flailing of the agent in the approximately correct direction.
	▲	Jaxan 10 minutes ago \| parent \| next [-]
		How do you know the correct direction? Isn’t the point of learning that the right path is unknown to start with?
	▲	thegeomaster 35 minutes ago \| parent \| prev [-]
		Article talks about all of this and references DeepSeek R1 paper[0], section 4.2 (first bullet point on PRM) on why this is much trickier to do than it appears. [0]: https://arxiv.org/abs/2501.12948