▲ | nopinsight 3 days ago | |
Research by METR suggests that frontier LLMs can perform software tasks over exponentially longer time horizon required for human engineers, with ~7-month for each doubling. o3 is above the trend line. https://x.com/METR_Evals/status/1912594122176958939 —- The AlexNet paper which kickstarted the deep learning era in 2012 was ahead of the 2nd-best entry by 11%. Many published AI papers then advanced SOTA by just a couple percentage points. o3 high is about 9% ahead of o1 high on livebench.ai and there are also quite a few testimonials of their differences. Yes, AlexNet made major strides in other aspects as well but it’s been just 7 months since o1-preview, the first publicly available reasoning model, which is a seminal advance beyond previous LLMs. It seems some people have become desensitized to how rapidly things are moving in AI, despite its largely unprecedented pace of progress. Ref: - https://proceedings.neurips.cc/paper_files/paper/2012/file/c... | ||
▲ | kadushka 3 days ago | parent [-] | |
Imagenet had improved the error rate by 100*11/25=44%. o1 to o3 error rate went from 28 to 19, so 100*9/28=32%. But these are meaningless comparisons because it’s typically harder to improve already good results. |