Remix.run Logo
Retr0id 6 hours ago

> As proof, ABP with opus 4.6 as the driver scores 90.5% on the Online Mind2Web benchmark

And what does opus score with "regular" browser harnesses?

9wzYQbTYsAIc 5 hours ago | parent | next [-]

90% easy or 90% average?

theredsix 4 hours ago | parent [-]

90% average with 85.51% hard!

9wzYQbTYsAIc 4 hours ago | parent [-]

Nice! Will take a look at this for my homelab - was debating using crawl.cloudflare.com to try it out, as browser rendering was my next stretch goal.

esafak 5 hours ago | parent | prev [-]

https://huggingface.co/spaces/osunlp/Online_Mind2Web_Leaderb...

Retr0id 5 hours ago | parent [-]

Hm I can't see Opus 4.6 on there

theredsix 4 hours ago | parent [-]

I tweeted at the OSUNLP and they're backed up on eval validation. In the meantime, here's the benchmark repo with the saved runs and also instructions on how to run it locally. https://github.com/theredsix/abp-online-mind2web-results