Remix.run Logo
tjungblut 4 days ago

If you are curios, like me, how the actual reinforcement learning happens. It uses verl [1] underneath. The paper "HybridFlow: A Flexible and Efficient RLHF Framework" [2] explains it really well.

[1] https://github.com/volcengine/verl

[2] https://arxiv.org/abs/2409.19256v2