Flaws in this test setup:

  - A zero-shot prompt, run once (in total)
  - No planning run (which improves output)
  - Different coding harnesses & system prompts
  - Unknown provider for GLM (there are 15 different GLM-5.2 providers with varying quality & latency)
  - No documentation of thinking effort level
  - No vision model supplement (you can provide a subagent w/a vision model)

You can't take this comparison seriously. There were many different variables, no control, no repeat test. It's as useful a comparison as picking a random tweet with both models' names