Remix.run Logo
kimjune01 5 hours ago

Although Arena is adversarial and resistant to goodharting, it's not immune. Models that train on Arena converge on helpfulness, not necessarily truthiness