| ▲ | Tiberium 9 hours ago |
| It seems to really be a nice step-up and is getting quite close to the frontier. I wish they'd start focusing on the reasoning efficiency now, though. I have a simple (relatively) test task to evaluate LLMs: writing a simple math evaluator library in Nim (it's about 400-600 lines total max), and GLM 5.2 (xhigh which maps to max effort) spent over 15 minutes (!) reasoning, spending about 45k tokens, before it finally wrote the first file. I know it's hard to improve on that, but now that their models are good enough at raw intelligence, I think this should become a higher priority task. Currently on https://artificialanalysis.ai/#output-tokens GPT 5.5 xhigh spends 16k tokens total on average, GPT 5.5 high is 10k, Fable 5 33k, Opus 4.8 41k, GLM 5.2 is 42k. GPT 5.5 is extremely reasoning efficient. Of course if you convert those values to actual request cost, GLM 5.2 will probably beat GPT 5.5/Opus 4.8, but speed matters for a lot of people, I think. |
|
| ▲ | benjiro29 8 hours ago | parent | next [-] |
| GLM 5.2 Max = Opus 4.8 Max in thinking behavior. The thinking chain is so similar, and so is the amount of token usage on the output. If you want reasonable token usage, you need to run it GLM 5.2 at High. There is little drop in quality from Max to High (for most tasks). And it cuts token usage by 2 a 2.5x. GLM 5.2, Max is really something you only need for complex tasks. In essence, GLM 5.2 is Opus 4.8 its little brother, at a way, WAY cheaper price. There has been really no training on Opus models going on, really, none i tell you! /sarcasm |
| |
| ▲ | matheusmoreira 2 minutes ago | parent | next [-] | | > GLM 5.2 Max = Opus 4.8 Max in thinking behavior This is insane! I can't wait until technology progresses to the point we can run these things on consumer hardware! | |
| ▲ | vitalyan123 8 hours ago | parent | prev | next [-] | | distillation of thinking models is not particularly effective - both "Open"AI and Misanthropic don't show you the real chain of thought, only its severely downscaled version. both do everything in their power to combat such outrageous copyright infringement, so the bulk of unethically scrapped data the Chinese have is from several generations ago. | | |
| ▲ | nyrikki 44 minutes ago | parent | next [-] | | It is quite likely that the intermediate tokens don’t have ‘semantic import’[0] There are methods like Habitual Reasoning Distillation or Inverted Reasoning Traces [1] that can help. While there are reasons to hide the intermediate tokens from a IP protection stand point, there is also a need to hide more effective and efficient generating that doesn’t fit the R1 claims of an aha moment that has been debunked, but is a consumer expectation. While hidden intermediate tokens do increase the difficulty, it is not a from barrier in itself, especially as they are billed, given information about their length. [0] https://arxiv.org/abs/2504.09762v4 [1] https://arxiv.org/abs/2603.07267 | |
| ▲ | Bolwin 32 minutes ago | parent | prev | next [-] | | For Claude models at least, you can tell to just manually think in the output and it works fine. I do it reguralrly because for creative writing and summarization, they seem to believe they don't need to think at all, and get way worse results. | |
| ▲ | duskdozer 8 hours ago | parent | prev | next [-] | | >such outrageous copyright infringement Sarcasm, considering the source of their own training data? | | | |
| ▲ | overfeed 4 hours ago | parent | prev | next [-] | | FYI: model outputs are not protected by copyright. | |
| ▲ | ComputerGuru 5 hours ago | parent | prev | next [-] | | Supposedly there are “jailbreaks” that expose considerably more of the thinking traces. | |
| ▲ | 8 hours ago | parent | prev | next [-] | | [deleted] | |
| ▲ | mannanj 6 hours ago | parent | prev [-] | | The companies that did copyright infringement and unethically scrapped data think that copyright infringement and unethically scrapping data is wrong and needs to be stopped. Though only in particular situations, like when it’s done to them and not when they do it. Cause they have the power and are morally right and know better than you. And if you question this at all, well you’re a threat to American values and a supporter of the Chinese and leading to the break down of Democracy. This isn’t a type of reasoning argument or manipulation tactic used by the rich throughout history to trick the naive and gullible masses or anything like that. Trust me, I’m rich and I’m morally right. /sarcasm |
| |
| ▲ | maxdo 4 hours ago | parent | prev | next [-] | | looking at the score this is rather a gemini 3.5 flash competitor, yes, for cheaper, but distance to opus and fable is as big as their price diff. | |
| ▲ | FooBarWidget 3 hours ago | parent | prev [-] | | With such ridiculously long thinking traces I'm surprised max outperforms high. After all, performance falls off a hill after a certain amount of context, and long thinking traces can fill that up really quickly. |
|
|
| ▲ | alexjplant 2 hours ago | parent | prev | next [-] |
| > It seems to really be a nice step-up and is getting quite close to the frontier. IMHO it's already surpassed them. I vastly prefer my personal GLM and OpenCode setup to the Claude Code and Opus one that I have to use at work. The former makes way fewer StackOverflow brogrammer-tier mistakes and is considerably better at following instructions. The harness UX is also vastly superior as it doesn't ignore, randomly change, or incorrectly report settings. Maybe it's the harness and I'd have even greater success with OpenCode and Anthropic, but I think it safe to say that Anthropic's moat is evaporating. |
|
| ▲ | vorticalbox 9 hours ago | parent | prev | next [-] |
| This is a problem I find with opus is will spend so long thinking then going “but wait what if” To point where I stop it and simple tell it to “start writing code you can work it out as you go along” Seems writers block also effects LLM |
| |
| ▲ | robertkarl 6 hours ago | parent | next [-] | | https://arxiv.org/abs/2606.00206 In this paper they nerf an LLMs ability to emit waffling thinking tokens like "wait", "but", "alternatively", and the models (they're old, small models in the paper) terminate reasoning faster and perform better. I bet Anthropic is tuning this on their backend. | | |
| ▲ | meatmanek 2 hours ago | parent [-] | | This is super cool. Do you know if any of the inference backends (llama.cpp, vllm, etc) support this technique? |
| |
| ▲ | giancarlostoro 7 hours ago | parent | prev | next [-] | | I usually have Claude build a plan first, then I put it into an XML file it updates with phases, usually we talk about some of those tasks, and then once its good and I like it, I have Claude implement the plan. Another thing I tell Claude to do is to not guess, but look at documentation, it messes up a lot less, might use some tokens reading docs, but at least it has a higher success rate code wise. | | |
| ▲ | xstas1 7 hours ago | parent [-] | | XML?? | | |
| ▲ | giancarlostoro 7 hours ago | parent | next [-] | | Apparently because of how Claude is trained, even the system level prompts go through as XML, it works better with XML "prompting" so I figured I could have it write plans in XML. I need to update my ticketing tool to output XML maybe by default. https://www.reddit.com/r/ClaudeAI/comments/1psxuv7/anthropic... | | |
| ▲ | saltsucker 6 hours ago | parent [-] | | Comments later in thread say markdown works just as fine and that it’s more important to organize your plan into sections. Also just think about it, why would a model trained on the world’s corpus of text (that isnt formatted in xml) perform better with XML? It would be a better study if that post tested markdown, org, xml, json, etc. 10 times to see if their is a difference | | |
| |
| ▲ | root-parent 6 hours ago | parent | prev [-] | | XML stands for Xtra ML.... | | |
|
| |
| ▲ | mikeocool 8 hours ago | parent | prev | next [-] | | Seriously. Whenever I read the thinking output I get mad and turn down effort to medium or low. Just output the code and we’ll work through it! I feel similarly about having codex review claude’s plans. I don’t think I’ve ever seen it catch a major issue. It just points out things that would have inevitably been addressed during implementation anyway. | | |
| ▲ | SubiculumCode 3 hours ago | parent [-] | | A lot of times this is how humans work. Just start 'putting words on paper', 'think by doing', etc. sometimes it's more efficient to see why something won't work after writing a bit of it, and sometimes you get lucky and it works right off the bat |
| |
| ▲ | drob518 3 hours ago | parent | prev | next [-] | | Qwen is notorious for this, too. It’ll sometimes spin in a long loop of “But wait…” paragraphs. | |
| ▲ | epolanski 8 hours ago | parent | prev | next [-] | | Fable was 20 times worse on that. It's clear it was the vibe coding model, as like no other model before, fully turned you into his assistant instead of the other way around. | | |
| ▲ | RyanHamilton 8 hours ago | parent | next [-] | | Could it be possible, these firms are optimizing for two things: a) Better performance. b) Gathering data from you to further improve performance later. I've also found the huge amount of planning rather than iteration frustrating. I've felt like I'm teaching a junior! | | |
| ▲ | epolanski 8 hours ago | parent | next [-] | | I think they simply optimize around E2E benchmarks, none of those benchmarks is designed as multi turn assistance to the user, but going from a prompt straight to the final solution. | | |
| ▲ | celrod 2 hours ago | parent [-] | | Exactly. How can "we" develop and encourage benchmarks for multi-turn user assistance?
That is what I want.
I feel like the models and harnesses push much too hard against this workflow -- that they push you towards letting go and vibe coding, with only your discipline (and desire for a quality and maintainable product) holding it back. |
| |
| ▲ | happyPersonR 6 hours ago | parent | prev [-] | | more thinking == more tokens === more money LOLL | | |
| ▲ | overfeed 4 hours ago | parent | next [-] | | Os there a cost benchmark out there? I wonder how frontier models are doing over time for cost per problem solved. | |
| ▲ | drob518 3 hours ago | parent | prev [-] | | I think they are optimizing for one-shot performance because that will drive usage. They can’t afford to look bad in the benchmarks. And if that means consuming an order of magnitude more tokens, well, that’s good for business, too. |
|
| |
| ▲ | 8 hours ago | parent | prev [-] | | [deleted] |
| |
| ▲ | thinkingtoilet 8 hours ago | parent | prev [-] | | I've been having success with Opus but you REALLY have to tame it. Long prompts that list what files to look at, relationships between entities, etc... I went from regularly hitting my daily limit to almost never hitting it. Oh, and also I was being lazy with small changes and stopping that helped a lot too. As you said, it gets in these loops where it's just churning and if you don't stop it it can go on for way too long. |
|
|
| ▲ | h14h 7 hours ago | parent | prev | next [-] |
| Hopefully the recent work Moonshot did with Kimi K2.7 Code trickles in to the other open-model labs. Per AA, while K2.7 Code is roughly on par w/ K2.6 in terms of intelligence, it uses half the output tokens to get there. |
|
| ▲ | bertili 9 hours ago | parent | prev | next [-] |
| This is GLM 5.2 Max. GLM 5.2 High which use less than half[1] the tokens. [1] https://z.ai/blog/glm-5.2 |
| |
| ▲ | Tiberium 9 hours ago | parent [-] | | Yes, but the Artificial Analysis result is also from GLM 5.2 (max), not high. | | |
| ▲ | andai 9 hours ago | parent [-] | | They have this with a lot of models, measuring only the max setting, while the one you'd actually want to use for most tasks is much lower. | | |
| ▲ | epolanski 8 hours ago | parent [-] | | For the brief period with had Fable, I never had to use it above medium. Low nailed the overwhelming majority of mundane tasks on it's own, medium was good for more complex stuff. |
|
|
|
|
| ▲ | robmccoll 7 hours ago | parent | prev | next [-] |
| That's interesting. I gave nearly the same task to Gemma4 31b as a test yesterday. Write a symbolic math engine in Typescript that can perform evaluation and simple expression reductions over +-/*(). It performed the task correctly with minimal reasoning - much fewer reasoning tokens than output tokens. |
| |
| ▲ | gbingles 5 hours ago | parent [-] | | Tbh, so what? I googled "symbolic math engine in Typescript that can perform evaluation and simple expression reductions over +-/*()" and got what looks to be viable answers without using any AI model at all. Reciting well established things from memory isn't terribly interesting. Show it a novel codebase and have it implement something within it. | | |
| ▲ | SubiculumCode 3 hours ago | parent | next [-] | | TBH, while your point is a fair one, your attitude is off-putting and needlessly condescending. | |
| ▲ | drob518 3 hours ago | parent | prev [-] | | So, a natural question would be why a model would ever get it wrong? |
|
|
|
| ▲ | 4 hours ago | parent | prev | next [-] |
| [deleted] |
|
| ▲ | rdsubhas 6 hours ago | parent | prev | next [-] |
| As per stats in other comments, it is frontier, not close to frontier. |
|
| ▲ | 9 hours ago | parent | prev | next [-] |
| [deleted] |
|
| ▲ | cmrdporcupine 8 hours ago | parent | prev | next [-] |
| > Of course if you convert those values to actual request cost, GLM 5.2 will probably beat GPT 5.5/Opus 4.8, but speed matters for a lot of people, I think. GLM5.2 ends up being far more expensive than I thought it would be when I tried it on openrouter. I ground through $5 USD worth of tokens quite quickly. And this was high, not max. |
|
| ▲ | esafak 5 hours ago | parent | prev [-] |
| I agree. I've noticed that it is quite smart but it has a tendency to doubt itself and overthink. I monitor its internal dialogue and prod it when it does this. They need to optimize the chain of thought early stopping. |