Remix.run Logo
jkwang a day ago

GLM-5.2 is quietly becoming the most interesting open model release this year. The coding benchmarks are surprisingly close to frontier models at a fraction of the inference cost.

em500 a day ago | parent | next [-]

We've had the great small Qwen 3.6 early April that many could actually run on their laptop. Then similar from Google a few weeks later (Gemma4, better in prose, worse in code). Then the super cheap large Deepseek V4 a few weeks later. Then antirez DS4 build that made that actually runnable on MacBooks and Mac Studios. And now the "near-frontier / near-Opus" GLM 5.2.

For people who follow open LLMs, none of these were quiet and all were the most interesting open model release for a few days/weeks. In one or two months, it will be some other model again. Now I do appreciate the real rapid improvements in open models. But there's also a ton of hype and fast-fashion around all of this.

CuriouslyC a day ago | parent [-]

The difference here is that those small models are impressive, but not super useful. Deepseek 4 is impressively cheap for the intelligence, but not reliable enough to daily drive unless your time has low value.

GLM passes a meaningful threshold of reliability/utility that puts it in a different category for real work. Just like Opus really took off after passing a threshold with 4.5. It's the first open model to do that.

kgeist 20 hours ago | parent | next [-]

Qwen3.6-27b is surprisingly good for tasks that need modifying an existing repo by analogy with the existing code. For example, you have an existing CRUD app and want to add a new domain model and expose it via the API. Qwen3.6 analyzes how things are done in the project and usually makes it work flawlessly in one shot, and the code is what you expected more-less. Qwen3.6 only struggles with non-trivial code or when you bootstrap a project from scratch (due to the lack of world knowledge, it's a small model after all). But how often do you write non-trivial code or projects from scratch?

I once gave Sonnet 4.6 and Qwen 3.6 the same real-world task to compare: "extend the existing code with this new requirement". Qwen3.6-27b perfectly followed the existing conventions, while Sonnet 4.6 invented its own conventions that were rejected during CR by another dev (i.e. he basically chose Qwen3.6's output in a blind test). Qwen3.6-27b, run locally, also managed to finish faster on that task (mostly because Sonnet 4.6 made tool calling errors and removed some code by accident, so it spent additional time reverting its errors, and got somewhat confused in the process).

We already have production code running live that was written entirely by Qwen3.6-27b. Although, we plan to move to self-hosting GLM5.2 because it's more versatile.

hnfong a day ago | parent | prev [-]

Qwen models are super useful for those running local.

And there are valid reasons to run local, even if performance (quality and speed) aren't best.

epolanski a day ago | parent | prev [-]

To me DS 4 is still the most interesting due to much lower costs. Also DS 4 training isn't done yet.

From my Opus vs DS 4 Pro personal benchmarks, 16 different real-life work tasks, DS 4 has performed as well as Opus 4.8 high overall but with few drawbacks:

- on the 16 tasks, one needed several prompts to be steered back into the topic

- its review capabilities seem much worse

- DS4 had the cleanly better solution in 3 cases out of 16, with Opus "only" doing cleanly better 2 times out of 16. But still, I want to emphasize, is the worst case scenarios that imho matter the most, not the best ones, and on that front Opus outperformed.

That being said I spent less than 2$ of API working 4 days, which is more or less what I would've spent with Anthropic APIs for less than one task.