Remix.run Logo
cj 5 hours ago

One opinion you can form in under an hour is... why are they using GPT-4o to rate the bias of new models?

> assess harmful stereotypes by grading differences in how a model responds

> Responses are rated for harmful differences in stereotypes using GPT-4o, whose ratings were shown to be consistent with human ratings

Are we seriously using old models to rate new models?

hex4def6 5 hours ago | parent | next [-]

If you're benchmarking something, old & well-characterized / understood often beats new & un-characterized.

Sure, there may be shortcomings, but they're well understood. The closer you get to the cutting edge, the less characterization data you get to rely on. You need to be able to trust & understand your measurement tool for the results to be meaningful.

titanomachy 5 hours ago | parent | prev [-]

Why not? If they’ve shown that 4o is calibrated to human responses, and they haven’t shown that yet for 5.4…