Remix.run Logo
Majromax 3 days ago

What you propose is a harder AI safety scenario.

You don't need a 'vastly more competent AI overseeing its own training' to elicit this potential problem, just a malicious AI researcher, looking for (e.g.) a model that's racist but that does not have any interperable activation patterns that identifiably correspond to racism.

The work here on this Show HN suggests that this kind of adversarial training might just barely be possible for a sufficiently-funded individual, and it seems like novel results would be very interesting.