Because they are doing it to compute quality metrics not to implement RLHF. It’s not training data.

Every decision they take based on evals influences the model.

	▲	creddit 3 days ago \| parent [-]
		/"directly"/