Flaws in this test setup:
- A zero-shot prompt, run once (in total)
- No planning run (which improves output)
- Different coding harnesses & system prompts
- Unknown provider for GLM (there are 15 different GLM-5.2 providers with varying quality & latency)
- No documentation of thinking effort level
- No vision model supplement (you can provide a subagent w/a vision model)
You can't take this comparison seriously. There were many different variables, no control, no repeat test. It's as useful a comparison as picking a random tweet with both models' names