If you are curios, like me, how the actual reinforcement learning happens. It uses verl [1] underneath. The paper "HybridFlow: A Flexible and Efficient RLHF Framework" [2] explains it really well.
[1] https://github.com/volcengine/verl
[2] https://arxiv.org/abs/2409.19256v2