▲ | consumer451 5 days ago | |
It's related to the history of Simon Willison[0] having used this as a benchmark on many models.[1] I believe this model's output is noticeably superior... but yeah, people do tend to get hyperbolic when new stuff happens it their domain of interest. [0] https://news.ycombinator.com/user?id=simonw [1] https://www.google.com/search?q=simon+willison+pelican+ridin... | ||
▲ | littlestymaar 5 days ago | parent | next [-] | |
> I believe this model's output is noticeably superior Sure, but at the same time Qwen3-30B-A3-2507 is also doing much better than most older models, even the bigger — and more capable — so I don't know how much is due to actual progress and how much is a new version of benchmaxxing. | ||
▲ | ruszki 4 days ago | parent | prev [-] | |
And nowadays a better known benchmark, so data scientists can overfit their models to it even more, even when LLMs are famous for overfitting. So, I wouldn’t trust any results regarding this specific test nowadays. |