Remix.run Logo
mapontosevenths 4 hours ago

There's a guy on Youtube named Bijan Bowen who tests all the models (open and frontier) on a series of one/few shot programming exercises and has been for a long while now. You can pretty much watch him compare the results for any two models you're likely to be interested in.

I'm not affiliated, I just like his style and have found it handy. I know it's not very rigorous, but it's good enough for me and I've found his examples to pretty closely match the results I see in real life.

lambda 4 hours ago | parent [-]

OK, it looks like he did a browser OS test with both Claude 4 Opus and Qwen 3.6 35B-A3B.

Claude 4 Opus: https://youtu.be/J7omabtqnBM?t=193

Qwen 3.6 35B A3B: https://youtu.be/gVU-DQeqkI0?t=215

Qwen 3.6 produced far more working functionality than Claude 4 Opus did.

Obviously, just one test of a single one-shot prompt of a silly toy OS, but yeah, this particular test shows Qwen 3.6 running locally dramatically outperforming Claude 4 Opus, which was a frontier model a year ago.