Remix.run Logo
steve_adams_86 6 hours ago

It appears to be working for me, but... Maybe it's silently degrading? It's hard to say.

Retr0id 6 hours ago | parent | next [-]

The fact that it's hard to say is funny, in contrast with the fanfare surrounding the launch of Fable.

greenavocado 6 hours ago | parent [-]

Fable is currently way below many other models in the rankings due to some sort of internal throttling https://aistupidlevel.info/

GPT-5.4 is currently the strongest model (this changes hourly)

Methodology: https://aistupidlevel.info/faq#methodology

Retr0id 5 hours ago | parent | next [-]

Well, that's certainly some web design.

DetroitThrow 6 hours ago | parent | prev [-]

Methodology leaves a lot to be desired in terms of understanding the tasks you've used. Being detailed about why they're more meaningful tests than the long horizon and coding tests used by other rankings is important.

False positives and poorly defined tasks/acceptance criteria have let some models have insanely inflated scores on bad benchmarks.

And sure, you can say they're not disclosed to prevent gaming, but if you're the only one who can review them then the might as well be a random number generator display with an unreadable UI.

greenavocado 5 hours ago | parent [-]

You're not wrong, but the scores track with my experience switching between the proposed top variants. So there's my unscientific "evidence."

nrmitchi 5 hours ago | parent | prev | next [-]

I don't know how fast they reacted, but shortly after their documented time I started getting opus availability errors from fable requests, which seemed odd.

I'd also think that they would transparently degrade, just to prevent production outages for clients that are requesting Fable explicitly.

steve_adams_86 5 hours ago | parent [-]

I mean hard to say on such short notice because they can swap out models without any notice. In terms of performance, I'm not asking it to do anything crazy so I think results would be similar across both models.

It did just use a small harness to run docker compose with different envs and other settings to validate a very small change, so... Feels like Fable

nrmitchi 5 hours ago | parent [-]

No, I mean I was using fable (or, trying) and got an api error "Error: claude-opus-4-8[1m] is temporarily unavailable"

re-thc 6 hours ago | parent | prev | next [-]

> Maybe it's silently degrading? It's hard to say.

Opus 4.8 spams a lot more text. It'd be obvious.

blueaquilae 5 hours ago | parent | prev [-]

But token price is still fable level?