Remix clone Hacker News

new | show | ask | jobs Github

	▲	andai 7 hours ago
		The only benchmarks that matters is your actual task. I've had models that benched poorly but performed great. And I constantly see models at near the top of AA, which are terrible. There doesn't necessarily seem to be a lot of overlap between benchmarks and real world usage. (Let alone common sense!) As far as they go, though, these harder benchmarks match my experience more closely: https://deepswe.datacurve.ai/ and https://cognition.ai/blog/frontier-code Where we see "top" models drop way down in score when given longer tasks. That being said, I've had a reasonably pleasant time with GLM-5.2 so far. (And have had an OK time with DeepSeek as well.) By the time I'm done testing all the Chinese models, they'll be obsolete :)