Remix.run Logo
glerk 6 hours ago

This looks really great, more thoughtful than any benchmark that I've seen until now!

I'm curious if you're only interested in scoring frontier models or you would accept submission from custom harnesses? I am working on multi-model harnesses and would love to test them against your benchmark. Do you plan on releasing the tasks publicly?

swyx 44 minutes ago | parent [-]

> Do you plan on releasing the tasks publicly?

yep