Remix.run Logo
alex7o 6 hours ago

Ok I find it funny that people compare models and are like, opus 4.7 is SOTA and is much better etc, but I have used glm 5.1 (I assume this comes form them training on both opus and codex) for things opus couldn't do and have seen it make better code, haven't tried the qwen max series but I have seen the local 122b model do smarter more correct things based on docs than opus so yes benchmarks are one thing but reality is what the modes actually do and you should learn and have the knowledge of the real strengths that models posses. It is a tool in the end you shouldn't be saying a hammer is better then a wrench even tho both would be able to drive a nail in a piece of wood.

jxmesth 4 hours ago | parent | next [-]

The only reason I'm stuck with Claude and Chatgpt is because of their tool calling. They do have some pretty useful features like skills etc. I've tried using qwen and deepseek but they can't even output documents. How are you guys handling documents and excels with these tools? I'd love to switch tbh.

embedding-shape 4 hours ago | parent | next [-]

> I've tried using qwen and deepseek but they can't even output documents

What agent harness did you use? Usually, "write_file", "shell_exec" or similar is two of the first tools you add to an agent harness, after read_file/list_files. If it doesn't have those tools, unsure if you could even call it a agent harness in the first place.

jxmesth 4 hours ago | parent [-]

Sorry for the confusion, I was actually talking about their Web based chat. Since most of my work is governance and docs, I just use their Web chats and they just refuse to output proper documents like Claude or Chatgpt do.

embedding-shape 4 hours ago | parent | next [-]

Aha... Well, I let Codex (Claude Code would work too) manage/troubleshoot .xlsx files too, seems to handle it just fine (it tends to un-archive them and browse the resulting XML files without issues), seen it do similar stuff for .app and .docx files too so maybe give that a try with other harnesses/models too, they might get it :)

noduerme 2 hours ago | parent | prev [-]

You're not giving an AI command line access to your work computer? How do you expect to keep up? /s

dymk 2 hours ago | parent [-]

You give it command line access in a VM...

koen_hendriks 25 minutes ago | parent [-]

You mean a VM like the one that contains a 0day that can escape the sandbox that gets found every year at pwn2own?

ecocentrik 4 hours ago | parent | prev | next [-]

When was the last time you used Qwen models? Their 3.5 and 3.6 models are excellent with tool calling.

jxmesth 4 hours ago | parent [-]

I gave it a try a few weeks ago tbh, I'll give it another shot tho. I mainly use their Web chats since that's easier to use and previously, qwen, deepseek, kimi, all were unable to output proper docx files or use skills.

ecocentrik 4 hours ago | parent [-]

Try loading the models up in a coding harness like Claude Code. There's a few docx skills listed on Vercel's skill index.

https://skills.sh/tfriedel/claude-office-skills/docx

sscaryterry 3 hours ago | parent | prev | next [-]

You can use GLM-5.1 with claude code directly, I use ccs, GLM-5.1 setup as plan, but goes via API key.

jwitthuhn 4 hours ago | parent | prev | next [-]

I've been using qwen-code (the software, not to be confused with Qwen Code the service or Qwen Coder the model) which is a fork of gemini-cli and the tool use with Qwen models at least has been great.

estimator7292 2 hours ago | parent | prev [-]

You can use both codex and Claude CLI with local models. I used codex with Gemma4 and it did pretty well. I did get one weird session where the model got confused and couldn't decide which tools actually existed in its inventory, but usually it could use tools just fine.

ezekiel68 4 hours ago | parent | prev | next [-]

Qwen3-Coder produced much better rust code (that utilized rust's x86-64 vectorized extensions) a few months ago than Claude Opus or Google Gemini could. I was calling it from harnesses such as the Zed editor and trae CLI.

I was very impressed.

justincormack 3 hours ago | parent [-]

Codex is pretty good at Rust with x86 and arm intrinsics too, it replaced a bunch of hand written C/assembly code I was using. I will try Qwen and Kimi on this kind of task too.

ternaryoperator 5 hours ago | parent | prev | next [-]

The models test roughly equal on benchmarks, with generally small differences in their scores. So, it’s reasonable to choose the model based on other criteria. In my case, I’d switch to any vendor that had a decent plugin for JetBrains.

sirnicolaz 3 hours ago | parent | prev | next [-]

Consider that SWE benchmarking is mainly done with python code. It tells something

Moosdijk 5 hours ago | parent | prev | next [-]

I wonder why glm is viewed so positively.

Every time I try to build something with it, the output is worse than other models I use (Gemini, Claude), it takes longer to reach an answer and plenty of times it gets stuck in a loop.

pkulak 4 hours ago | parent | next [-]

I've been running Opus and GLM side-by side for a couple weeks now, and I've been impressed with GLM. I will absolutely agree that it's slow, but if you let it cook, it can be really impressive and absolutely on the level of Opus. Keep in mind, I don't really use AI to build entire services, I'm mostly using it to make small changes or help me find bugs, so the slowness doesn't bother me. Maybe if I set it to make a whole web app and it took 2 days, that would be different.

The big kicker for GLM for me is I can use it in Pi, or whatever harness I like. Even if it was _slightly_ below Opus, and even though it's slower, I prefer it. Maybe Mythos will change everything, but who knows.

tasuki 3 hours ago | parent [-]

> The big kicker for GLM for me is I can use it in Pi, or whatever harness I like.

Yes, but... isn't the same true for Opus and all the other models too?

slopinthebag 3 hours ago | parent [-]

Opus is about 7 times more expensive than GLM with API pricing. And since you can only use the Opus subscription plan in CC, you're essentially locked into API pricing for Pi and any other harness.

So you're either paying $1000's for Opus in Pi, or $30/month for GLM in Pi. If the results are mostly equivalent that's an easy choice for most of us.

tasuki 2 hours ago | parent [-]

Perhaps I'm being extremely daft: If the API is 7 times more expensive, then why is it $1000 vs $30? Or is there a GLM subscription one can use with Pi? Certainly not available in my (arguably outdated) Pi.

RussianCow 2 hours ago | parent | next [-]

I'm not the OP, but it's the latter. I'm currently using the "Lite" GLM subscription with OpenCode, for example. I'm not using it very heavily, but I haven't come close to hitting the limits, whereas I burned through my weekly limits with Claude very regularly.

girvo an hour ago | parent | prev [-]

You can use GLM’s coding plan in Pi, just use the anthropic API instead of the OpenAI compatible one they give.

probst an hour ago | parent [-]

Or tell pi to add support for the coding plan directly. That gave me GLM-5.1 support in no time along with support for showing the remaining quota, etc, too.

It also compresses the context at around 100k tokens.

In case anyone is interested: https://github.com/sebastian/pi-extensions/tree/main/.pi/ext...

Mashimo 4 hours ago | parent | prev | next [-]

I have used GLM 4.7, 5 and 5.1 now for about 3 month via OpenCode harness and I don't remember it every being stuck in a loop.

You have to keep it below ~100 000 token, else it gets funny in the head.

I only use it for hobby projects though. Paid 3 EUR per month, that is not longer available though :( Not sure what I will choose end of month. Maybe OpenCode Go.

spaceman_2020 2 hours ago | parent | prev | next [-]

I think it offers a very good tradeoff of cost vs competency

4.7 is better, but its also wildly expensive

Akira1364 4 hours ago | parent | prev | next [-]

IDK about GLM but GPT 5.4 Extra High has been great when I've used it in the VS Code Copilot extension, I see no actual reason Opus should consume 3x more quota than it the way it does

slopinthebag 4 hours ago | parent | prev [-]

You're probably just holding it wrong.

cornedor 5 hours ago | parent | prev | next [-]

I tried GLM and Qwen last week for a day. And some issues it could solve, while some, on surface relatively easy, task it just could not solve after a few tries, that Opus oneshotted this morning with the same prompt. It’s a single example ofcourse, but I really wanted to give it a fair try. All it had to do was create a sortable list in Magento admin. But on the other hand, GLM did oneshot a phpstorm plugin

dev_l1x_be 4 hours ago | parent [-]

Do you use Opus through the API or with subscription? Did you use OpenCode or Code?

cornedor 3 hours ago | parent [-]

Opus trough Claude Code, the Chinese models trough OpenCode Go, which seems like a great package to test them out.

odie5533 2 hours ago | parent | prev | next [-]

If you showed me code from GLM 5.1, Opus 4.6, and Kimi K2.6, my ranking for best model would be highly random.

dev_l1x_be 4 hours ago | parent | prev | next [-]

Benchmarking is grossly misleading. Claude’s subscription with Code would not score this high on the benchmarks because how they lobotomized agentic coding.

solomatov 4 hours ago | parent | prev | next [-]

>but I have seen the local 122b model do smarter more correct things based on docs than opus

Could you please share more about this

alex7o 38 minutes ago | parent [-]

Maybe a bit misleading. I have used in in two places.

One Is for local opencode coding and config of stuff the other is for agent-browser use and for both it did better (opus 4.6) for the thing I was testing atm. The problem with opus at the moment I tired it was overthinking and moving itself sometimes I the wrong direction (not that qwen does overthink sometimes). However sometimes less is more - maybe turning thinking down on opus would have helped me. Some people said that it is better to turn it of entirely when you start to impmenent code as it already knows what it needs to do it doesn't need more distraction.

Another example is my ghostty config I learned from queen that is has theme support - opus would always just make the theme in the main file

FlyingSnake 5 hours ago | parent | prev | next [-]

I tried GLM5.1 last week after reading about it here. It was slow as molasses for routine tasks and I had to switch back to Claude. It also ran out of 5H credit limit faster than Claude.

bensyverson 5 hours ago | parent | next [-]

If you view the "thinking" traces you can see why; it will go back and forth on potential solutions, writing full implementations in the thinking block then debating them, constantly circling back to points it raised earlier, and starting every other paragraph with "Actually…" or "But wait!"

nothinkjustai 5 hours ago | parent | next [-]

I see this with Opus too.

girvo an hour ago | parent [-]

Indeed. And that’s with Anthropic hiding reading traces unlike these other comparisons.

FlyingSnake 5 hours ago | parent | prev [-]

> "Actually…" or "But wait!"

You’re absolutely right!

Jokes apart, I did notice GLM doing these back and forth loops.

tonyarkles 4 hours ago | parent [-]

I was watching Qwen3.6-35B-A3B (locally) doing the same dance yesterday. It eventually finished and had a reasonable answer, but it sure went back and forth on a bunch of things I had explicitly said not to do before coming to a conclusion. At least said conclusion was not any of the things I'd said not to do.

Lerc 3 hours ago | parent [-]

That is essentially what the reasoning reinforcement training does. It is getting the model to say things that are more likely to result in the correct final answer. Everything it does in between doesn't necessarily need to be valid argument to produce the answer. You can think of it as filling the context with whatever is needed to make the right answer come out next. Valid arguments obviously help. but so might expressions of incorrect things that are not obviously untrue to the model until it sees them written out. The What's The Magic Word paper shows how far that could go. If the policy model managed to learn enough magic words it would be theoretically possible to end up with an LLM that spouts utter gibberish until delivering the correct answer seemingly out of the blue.

tonyarkles 3 hours ago | parent [-]

That's pretty cool, thanks for the extra context! (pardon the... not even pun I guess)

Also, thanks for pointing me at that specific paper; I spend a lot more of my life closer to classical control theory than ML theory so it's always neat to see the intersection of them. My unsubstantiated hypothesis is that controls & ML are going to start getting looked at more holistically, and not in the way I normal see it (which is "why worry about classical control theory, just solve the problem with RL"). Control theory is largely about steering dynamic systems along stable trajectories through state space... which is largely what iterative "fill in the next word" LLM models are doing. The intersection, I hope, will be interesting and add significant efficiency.

nothinkjustai 5 hours ago | parent | prev [-]

Z.ai’s cloud offering is poor, try it with a different provider.

OtomotO 6 hours ago | parent | prev [-]

Many people averted religion (which I can get behind with), but have never removed the dogmatic thinking that lay at its root.

As so many things these days: It's a cult.

I've used Claude for many months now. Since February I see a stark decline in the work I do with it.

I've also tried to use it for GPU programming where it absolutely sucks at, with Sonnet, Opus 4.5 and 4.6

But if you share that sentiment, it's always a "You're just holding it wrong" or "The next model will surely solve this"

For me it's just a tool, so I shrug.

balls187 6 hours ago | parent | next [-]

> I've used Claude for many months now. Since February I see a stark decline in the work I do with it.

I find myself repeating the following pattern: I use an AI model to assist me with work, and after some time, I notice the quality doesn't justify the time investment. I decide to try a similar task with another provider. I try a few more tests, then decide to switch over for full time work, and it feels like it's awesome and doing a good job. A few months later, it feels like the model got worse.

runarberg 5 hours ago | parent | next [-]

I wonder about this. I see two obvious possibilities (if we ignore bias):

1. The models are purposefully nerfed, before the release of the next model, similar to how Apple allegedly nerfed their older phones when the next model was out.

2. You are relying more and more on the models and are using your talent less and less. What you are observing is the ratio of your vs. the model’s work leaning more and more to the model’s. When a new model is released, it produces better quality code then before, so the work improves with it, but your talent keeps deteriorating at a constant rate.

ehnto 5 hours ago | parent | next [-]

I definitely find your last point is true for me. The more work I am doing with AI the more I am expecting it to do, similar to how you can expect more over time from a junior you are delegating to and training. However the model isn't learning or improving the same way, so your trust is quickly broken.

As you note, the developer's input is still driving the model quite a bit so if the developer is contributing less and less as they trust more, the results would get worse.

tonyarkles 4 hours ago | parent | next [-]

> However the model isn't learning or improving the same way, so your trust is quickly broken.

One other failure mode that I've seen in my own work while I've been learning: the things that you put into AGENTS.md/CLAUDE.md/local "memories" can improve performance or degrade performance, depending on the instructions. And unless you're actively quantitatively reviewing and considering when performance is improving or degrading, you probably won't pick up that two sentences that you added to CLAUDE.md two weeks ago are why things seem to have suddenly gotten worse.

> similar to how you can expect more over time from a junior you are delegating to and training

That's the really interesting bit. Both Claude and Codex have learned some of my preferences by me explicitly saying things like "Do not use emojis to indicate task completion in our plan files, stick to ASCII text only". But when you accidentally "teach" them something that has a negative impact on performance, they're not very likely to push back, unlike a junior engineer who will either ignore your dumb instruction or hopefully bring it up.

> As you note, the developer's input is still driving the model quite a bit so if the developer is contributing less and less as they trust more, the results would get worse.

That is definitely a thing too. There have been a few times that I have "let my guard down" so to speak and haven't deeply considered the implications of every commit. Usually this hasn't been a big deal, but there have been a few really ugly architectural decisions that have made it through the gate and had to get cleaned up later. It's largely complacency, like you point out, as well as burnout trying to keep up with reviewing and really contemplating/grokking the large volume of code output that's possible with these tools.

svnt 5 hours ago | parent | prev [-]

Your version of the last point is a bit softer I think — parent was putting it down to “loss of talent” but yours captures the gaps vs natural human interaction patterns which seems more likely, especially on such short timescales.

runarberg 5 hours ago | parent [-]

I confusingly say both. First I say that the ratio of work coming from the model is increasing, and when I am clarifying I say “your talent keeps deteriorating”. You correctly point out these are distinct, and maybe this distinction is important, although I personally don‘t think so. The resulting code would be the same either way.

Personally I can see the case for both interpretation to be true at the same time, and maybe that is precisely why I confused them so eagerly in my initial post.

rescbr 2 hours ago | parent | prev | next [-]

I don’t think the providers intentionally nerf the models to make the new one look better. It’s a matter of them being stingy with infrastructure, either by choice to increase profit and/or sheer lack of resources to keep n+1 models deployed in parallel without deprecating older ones when a new one is released.

I’d prefer providers to simply deprecate stuff faster, but then that would break other people’s existing workflows.

flux3125 4 hours ago | parent | prev [-]

Point 2 is so true, I definitely find myself spending more time reading code vs writing it. LLMs can teach you a lot, but it's never the same as actually sitting down and doing it yourself.

e12e 5 hours ago | parent | prev [-]

I think it might have to do with how models work, and fundamental limits with them (yes, they're stochastic parrots, yes they confabulate).

Newer (past two years?) models have improved "in detail" - or as pragmatic tools - but they still don't deserve the anthropomorphism we subject them to because they appear to communicate like us (and therefore appear to think and reason, like us).

But the "holes" are painted over in contemporary models - via training, system prompts and various clever (useful!) techniques.

But I think this leads us to have great difficulty spotting the weak spots in a new, or slightly different model - but as we get to know each particular tool - each model - we get better at spotting the holes on that model.

Maybe it's poorly chosen variable names. A tendency to write plausible looking, plausibly named, e2e tests that turns out to not quite test what they appear to test at first glance. Maybe there's missing locking of resources, use of transactions, in sequencial code that appear sound - but end up storing invalid data when one or several steps fail...

In happy cases current LLMs function like well-intentioned junior coders enthusiasticly delivering features and fixing bugs.

But in the other cases, they are like patholically lying sociopaths telling you anything you want to hear, just so you keep paying them money.

When you catch them lying, it feels a bit like a betrayal. But the parrot is just tapping the bell, so you'll keep feeding it peanuts.

taurath 6 hours ago | parent | prev | next [-]

I agree - the problem is it’s hard to see how people who say they’re using it effectively actually are using it, what they’re outputting, and making any sort of comparison on quality or maintainability or coherence.

In the same way, it’s hard to see how people who say they’re struggling are actually using it.

There’s truth somewhere in between “it’s the answer to everything” and “skill issue”. We know it’s overhyped. We know that it’s still useful to some extent, in many domains.

balls187 5 hours ago | parent [-]

Well summarized.

We're also seeing that the people up top are using this to cull the herd.

psychoslave 6 hours ago | parent | prev | next [-]

What is it that is dogma free? If one goes hardcore pyrrhonism, doubting that there is anything currently doubting as this statement is processed somehow, that is perfectly sound.

At some point the is a need to have faith in some stable enough ground to be able to walk onto.

Wolfbeta 4 hours ago | parent [-]

Who controls that need for you?

ecshafer 6 hours ago | parent | prev | next [-]

All people think dogmatically. The only difference is what the ontological commitments and methaphysical foundations are. Take out God and people will fit politics, sports teams, tools, whatever in there. Its inescapable.

smallmancontrov 4 hours ago | parent | next [-]

All people think dogmatically, but religion does not prevent people from acting dogmatically in politics, sports, etc. It just doesn't. It never did.

Under normal circumstances I'd consider this a nit and decline to pick it, but the number of evangelists out there arguing the equivalent of "cure your alcohol addiction with crystal meth!" is too damn high.

bensyverson 6 hours ago | parent | prev | next [-]

Allow me to introduce you to Buddhism

ecshafer 5 hours ago | parent | next [-]

Elaborate. Buddhism is going to have the same epistemological issues as anything, since its a human consciousness issue.

bensyverson 3 hours ago | parent | next [-]

> since its a human consciousness issue

I'd encourage you to check it out for yourself. It's certainly possible to be a dogmatic Buddhist, but one of the foundational beliefs of Buddhism is that the type of dogmatic attachment you're describing is avoidable. It's not easy, but that's why you meditate.

tauroid 4 hours ago | parent | prev [-]

https://en.wikipedia.org/wiki/Prat%C4%ABtyasamutp%C4%81da

svnt 5 hours ago | parent | prev [-]

Which one?

bensyverson 5 hours ago | parent [-]

Zen

svnt 4 hours ago | parent [-]

The Western Zen? In my experience it is downgraded from being a religion to being a system of practice which relieves it of the broader Mahayana cosmology. But I would suggest the dogma is less obvious but still there, often just somewhere else, such as in its own limitations, or in a philosophical container at a higher level such as scientism.

bensyverson 3 hours ago | parent [-]

All Zen is about releasing those attachments. Granted it's pretty hard, because if you succeed, you're enlightened.

East, West, Religion, Practice… From a Zen perspective, you're just troubling your mind with binaries and conflict.

svnt 2 hours ago | parent [-]

Ah and there is the dogma -- the otherness of the enlightened.

The binaries still functionally exist. I see a lot of value in reflective practices. At the same time it seems unlikely to me that the point of existing is to not trouble your mind.

bensyverson an hour ago | parent [-]

There's a saying in Zen: if you meet the buddha on the road, kill him. The point being, the very exaltation of enlightenment is an impediment.

If Buddhism can be said to have a goal, it is to reduce suffering (including your own), so troubling your own mind is indeed something it can help with. The point of existence would be something interesting to meditate on. If you discover it, let us all know!

OtomotO 5 hours ago | parent | prev [-]

Dogmatism is a spectrum and for too many people it's on the animal side of the scale.

taneq 5 hours ago | parent | prev | next [-]

I wonder to what degree it depends on how easy you find coding in general. I find for the early steps genAI is great to get the ball rolling, but rapidly it becomes more work to explain what it did wrong and how to fix it (and repeat until it does so) than to just fix the code myself.

redsocksfan45 4 hours ago | parent | prev [-]

[dead]