I've been playing with 3.5:122b on a GH200 the past few days for rust/react/ts, and while it's clearly sub-Sonnet, with tight descriptions it can get small-medium tasks done OK - as well as Sonnet if the scope is small.

The main quirk I've found is that it has a tendency to decide halfway through following my detailed instructions that it would be "simpler" to just... not do what I asked, and I find it has stripped all the preliminary support infrastructure for the new feature out of the code.

▲

sheepscreek 6 hours ago | parent | next [-]

That sounds awfully similar to what Opus 4.6 does on my tasks sometimes.

> Blah blah blah (second guesses its own reasoning half a dozen times then goes). Actually, it would be a simpler to just ...

Specifically on Antigravity, I've noticed it doing that trying to "save time" to stay within some artificial deadline.

It might have something to do with the system messages and the reinforcement/realignment messages that are interwoven into the context (but never displayed to end-users) to keep the agents on task.

▲

jtonz 3 hours ago | parent | next [-]

As someone that started using Co-work, I feel like I am going insane with the frequency that I have to keep telling it to stay on task.

If you ask it to do something laborious like review a bunch of websites for specific content it will constantly give up, providing you information on how you can continue the process yourself to save time. Its maddening.

▲

zzrrt 2 hours ago | parent | next [-]

That’s pretty funny when compared with the rhetoric like “AI doesn’t get tired like humans.” No, it doesn’t, but it roleplays like it does. I guess there is too much reference to human concerns like fatigue and saving effort in the training.

▲

martin-t 2 hours ago | parent [-]

This is what happens when a bunch of billionaires convince people autocomplete is AI.

Don't get me wrong, it's very good autocomplete and if you run it in a loop with good tooling around it, you can get interesting, even useful results. But by its nature it is still autocomplete and it always just predicts text. Specifically, text which is usually about humans and/or by humans.

	▲	root_axis 38 minutes ago \| parent [-]
		Yep. The veil of coherence extends convincingly far by means of absurd statistical power, but the artifacts of next token prediction become far more obvious when you're running models that can work on commodity hardware

▲

bandrami an hour ago | parent | prev | next [-]

It really is like having an intern, then

▲

throwup238 2 hours ago | parent | prev [-]

In my experience all of the models do that. It's one of the most infuriating things about using them, especially when I spend hours putting together a massive spec/implementation plan and then have to sit there babysitting it going "are you sure phase 1 is done?" and "continue to phase 2"

I tend to work on things where there is a massive amount of code to write but once the architecture is laid down, it's just mechanical work, so this behavior is particularly frustrating.

▲

wood_spirit 6 hours ago | parent | prev [-]

Yeah that happened to me with Claude code opus 4.6 1M for the first time today. I had to check the model hadn’t changed. It was weird. I was imagining that maybe anthropic have a way of deciding how much resource a user actually gets and they had downgraded me suddenly or something.

▲

e1g 5 hours ago | parent [-]

Claude Code recently downgraded the default thinking level to “medium”, so it’s worth checking your settings.

	▲	nekitamo 3 hours ago \| parent [-]
		Thank you. The difference was quite noticeable today.

▲

shaan7 6 hours ago | parent | prev | next [-]

> that it would be "simpler" to just... not do what I asked

That sounds too close to what I feel on some days xD

▲

storus 5 hours ago | parent | prev | next [-]

> to decide halfway through following my detailed instructions that it would be "simpler" to just... not do what I asked

That's likely coming from the 3:1 ratio of linear to quadratic attention usage. The latest DeepSeek also suffers from it which the original R1 never exhibited.

▲

reactordev 8 hours ago | parent | prev | next [-]

Turn down the temperature and you’ll see less “simpler” short cuts.

▲

smokel 6 hours ago | parent [-]

For the uninitiated: Interestingly, it is not advisable to take this to the extreme and set temperature to 0.

That would seem logical, as the results are then completely deterministic, but it turns out that a suboptimal token may result in a better answer in the long run. Also, allowing for a little bit of noise gives the model room to talk itself out of a suboptimal path.

▲

LoganDark 6 hours ago | parent [-]

I like to think of this like tempering the output space. With a temperature of zero, there is only one possible output and it may be completely wrong. With even a low temperature, you drastically increase the chances that the output space contains a correct answer, through containing multiple responses rather than only one.

I wonder if determinism will be less harmful to diffusion models because they perform multiple iterations over the response rather than having only a single shot at each position that lacks lookahead. I'm looking forward to finding out and have been playing with a diffusion model locally for a few days.

	▲	reactordev 5 hours ago \| parent [-]
		Yup. I think of it as how off the rails do you want to explore? For creative things or exploratory reasoning, a temperature of 0.8 lends us to all sorts of excursions down the rabbit hole. However, when coding and needing something precise, a temperature of 0.2 is what I use. If I don’t like the output, I’ll rephrase or add context.

▲

slices 3 hours ago | parent | prev | next [-]

I've seen behavior like that when the model wasn't being served with sufficiently sized context window

▲

Aurornis 5 hours ago | parent | prev [-]

> The main quirk I've found is that it has a tendency to decide halfway through following my detailed instructions that it would be "simpler" to just... not do what I asked,

This is my experience with the Qwen3-Next and Qwen3.5 models, too.

I can prompt with strict instructions saying "** DO NOT..." and it follows them for a few iterations. Then it has a realization that it would be simpler to just do the thing I told it not to do, which leads it to the dead end I was trying to avoid.