Remix.run Logo
ericol 3 hours ago

I did some work yesterday with Opus and found it amazing.

Today we are almost on non-speaking terms. I'm asking it to do some simple stuff and he's making incredible stupid mistakes:

    This is the third time that I have to ask you to remove the issue that was there for more than 20 hours. What is going on here?
and at the same time the compacting is firing like crazy. (What adds ~4 minute delays every 1 - 15 minutes)

  | # | Time     | Gap before | Session span | API calls |
  |---|----------|-----------|--------------|-----------|
  | 1 | 15:51:13 | 8s        | <1m          | 1         |
  | 2 | 15:54:35 | 48s       | 37m          | 51        |
  | 3 | 16:33:33 | 2s        | 19m          | 42        |
  | 4 | 16:53:44 | 1s        | 9m           | 30        |
  | 5 | 17:04:37 | 1s        | 17m          | 30        |
  # — sequential compaction event number, ordered by time.

  Time — timestamp of the first API call in the resumed session, i.e. when the new context (carrying the compaction summary) was first sent to the
  model.

  Gap before — time between the last API call of the prior session and the first call of this one. Includes any compaction processing time plus user
   think time between the two sessions.

  Session span — how long this compaction-resumed session ran, from its first API call to its last before the next compaction (or end of session).

  API calls — total number of API requests made during this resumed session. Each tool use, each reply, each intermediate step = one request.

Bottomline, I will probably stay on Sonnet until they fix all these issues.
aulin 3 hours ago | parent | next [-]

They won't. These are not "issues", it's them trying to push the models to burn less compute. It will only get worse.

criemen 3 hours ago | parent | next [-]

> it's them trying to push the models to burn less compute

I'm curious, how does using more tokens save compute?

b65e8bee43c2ed0 2 hours ago | parent | next [-]

productivity (tokens per second per hardware unit) increases at the cost of output quality, but the price remains the same.

both Anthropic and OpenAI quantize their models a few weeks after release. they'd never admit it out loud, but it's more or less common knowledge now. no one has enough compute.

sthimons 2 hours ago | parent | next [-]

Pretty bold claim - you have a source for that?

Rapzid 2 hours ago | parent [-]

There is no evidence TMK that the accuracy the models change due to release cycles or capacity issues. Only latency. Both Anthropic and OpenAI have stated they don't do any inference compute shenanigans due to load or post model release optimization.

Tons of conspiracy theories and accusations.

I've never seen any compelling studies(or raw data even) to back any of it up.

cebert 2 hours ago | parent | prev [-]

Do you have a source for that claim?

b65e8bee43c2ed0 2 hours ago | parent [-]

my source is that people have been noticing this since GPT4 days.

https://arxiv.org/pdf/2307.09009

but of course, this isn't a written statement by a corporate spokespersyn. I don't think that breweries make such statements when they water their beer either.

shortstuffsushi 3 hours ago | parent | prev | next [-]

I think that the idea is each action uses more tokens, which means that users hit their limit sooner, and are consequently unable to burn more compute.

ryanschaefer 3 hours ago | parent [-]

What?

bloppe 3 hours ago | parent | prev [-]

It could be the adaptive reasoning

rustyhancock 2 hours ago | parent | prev [-]

If you've not seen Common People Black Mirror episode I strongly recommend it.

The only misprediction it makes is that AI is creating the brain dead user base...

You have to hook your customers before you reel them in!

https://www.netflix.com/gb/title/70264888?s=a&trkid=13747225...

whalesalad 3 hours ago | parent | prev | next [-]

I am having a shit experience lately. Opus 4.7, max effort.

> You're right, that was a shit explanation. Let me go look at what V1 MTBL actually is before I try again.

> Got it — I read the V1 code this time instead of guessing. Turns out my first take was wrong in an important way. Let me redo this in English.

:facepalm:

tremon 3 hours ago | parent | next [-]

> I read the V1 code this time instead of guessing

Does the LLM even keep a (self-accessible) record of previous internal actions to make this assertion believable, or is this yet another confabulation?

johnmaguire 2 hours ago | parent [-]

Yes, the LLM is able to see the entire prior chat history including tool use. This type of interaction occurs when the LLM fails to read the file, but acts as though it had.

al_borland 3 hours ago | parent | prev | next [-]

This seems like the experience I've had with every model I've tried over the last several years. It seems like an inherent limitation of the technology, despite the hyperbolic claims of those financially invested in all of this paying off.

smt88 3 hours ago | parent [-]

Opus 4.6 pre-nerf was incredible, almost magical. It changed my understanding of how good models could be. But that's the only model that ever made me feel that way.

al_borland 2 hours ago | parent | next [-]

That was better, but still not to the point that I just let it go on my repo.

whalesalad 3 hours ago | parent | prev [-]

Yes! I genuinely got a LOT of shit done with Opus 4.6 "pre nerf" with regular old out-of-the-box config, no crazy skills or hacks or memory tweaks or anything. The downfall is palpable. Textbook rugpull.

ericol 3 hours ago | parent | prev | next [-]

Matches what I am experiencing. Makes incredible stupid mistakes.

The weird stuff is yesterday I asked it to test and report back on a 30+ commit branch for a PR and it did that flawlessly.

ed_elliott_asc 2 hours ago | parent | prev | next [-]

If it isn’t working for you why don’t you choose an older model? 4.6

alphabettsy 3 hours ago | parent | prev [-]

The docs suggest not using max effort in most cases to avoid overthinking :shrug:

whalesalad 3 hours ago | parent [-]

They've jumped the shark. I truly can't comprehend why all of these changes were necessary. They had a literal money printing machine that actually got real shit done, really well. Now it's a gamble every time and I am pulling back hard from Anthropic ecosystem.

geraldwhen 2 hours ago | parent [-]

It seems clear that it was a money spending machine, not a money printing machine.

cadamsdotcom 2 hours ago | parent | prev [-]

> he’s making .. mistakes

Claude and other LLMs do not have a gender; they are not a “he”. Your LLM is a pile of weights, prompts, and a harness; anthropomorphising like this is getting in the way.

You’re experiencing what happens when you sample repeatedly from a distribution. Given enough samples the probability of an eventual bad session is 100%.

Just clear the context, roll back, and go again. This is part of the job.

yokoprime 2 hours ago | parent [-]

Why be so upset at someone using pronouns with a LLM?