Because they are doing it to compute quality metrics not to implement RLHF. It’s not training data.
Every decision they take based on evals influences the model.
/"directly"/