| ▲ | waNpyt-menrew 3 hours ago | |
Larger model, better benchmarks. Bigger bomb more yield. Any benchmarks where we constraint something like thinking time or power use? Even if this were released no way to know if it’s the same quant. | ||
| ▲ | omcnoe an hour ago | parent [-] | |
Yes - eg. page 192 BrowseComp bunchmark. Mythos preview has higher accuracy with fewer tokens used than any previous Claude model. Though, the fact that this incredibly strong result was only presented for BrowseComp (a kind of weird benchmark about searching for hard to find information on the internet) and not for the other benchmarks implies that this result is likely not the same for those other benchmarks. | ||