Remix.run Logo
kalkin 4 hours ago

AFAICT this uses a token-counting API so that it counts how many tokens are in the prompt, in two ways, so it's measuring the tokenizer change in isolation. Smarter models also sometimes produce shorter outputs and therefore fewer output tokens. That doesn't mean Opus 4.7 necessarily nets out cheaper, it might still be more expensive, but this comparison isn't really very useful.

h14h 4 hours ago | parent | next [-]

For some real data, Artificial Analysis reported that 4.6 (max) and 4.7 (max) used 160M tokens and 100M tokens to complete their benchmark suite, respectively:

https://artificialanalysis.ai/?intelligence-efficiency=intel...

Looking at their cost breakdown, while input cost rose by $800, output cost dropped by $1400. Granted whether output offsets input will be very use-case dependent, and I imagine the delta is a lot closer at lower effort levels.

theptip 2 hours ago | parent [-]

This is the right way of thinking end-to-end.

Tokenizer changes are one piece to understand for sure, but as you say, you need to evaluate $/task not $/token or #tokens/task alone.

SkyPuncher 4 hours ago | parent | prev | next [-]

Yes. I actually noticed my token usage go down on 4.6 when I started switching every session to max effort. I got work done faster with fewer steps because thinking corrected itself before it cycled.

I’ve noticed 4.7 cycling a lot more on basic tasks. Though, it also seems a bit better at holding long running context.

manmal 4 hours ago | parent | prev | next [-]

Why is it not useful? Input token pricing is the same for 4.7. The same prompt costs roughly 30% more now, for input.

dktp 4 hours ago | parent | next [-]

The idea is that smarter models might use fewer turns to accomplish the same task - reducing the overall token usage

Though, from my limited testing, the new model is far more token hungry overall

manmal 4 hours ago | parent [-]

Well you‘ll need the same prompt for input tokens?

httgbgg 3 hours ago | parent [-]

Only the first one. Ideally now there is no second prompt.

manmal 3 hours ago | parent [-]

Are you aware that every tool call produces output which also counts as input to the LLM?

4 hours ago | parent | prev | next [-]
[deleted]
kalkin 4 hours ago | parent | prev [-]

That's valid, but it's also worth knowing it's only one part of the puzzle. The submission title doesn't say "input".

the_gipsy 3 hours ago | parent | prev [-]

With AIs, it seems like there never is a comparison that is useful.

theptip 2 hours ago | parent | next [-]

You can build evals. Look at Harbor or Inspect. It’s just more work than most are interested in doing right now.

jascha_eng 3 hours ago | parent | prev [-]

yup its all vibes. And anthropic is winning on those in my book still