This matches my experience. Burned $2K to see how it will perform on frontend tasks and backend tasks.

Frontend did a significantly better job than Opus on toy-scale wireframe projects by using gimmicks like fluid dynamics. Then when given medium to big tasks like multi-page web app where layouts and aesthetics must be decided by model itself, results by Fable and Opus scored indistinguishable score from human judges.

Backend, gave tasks related to setting up a data flow that involves Postgres, R2, Kubernetes, gVisor, so on. The noticeable gap was, Opus did better than Sonnet, but Fable actually returned a result that fails and confidently stated it ran X, Y, Z tests to ensure it works and got these results. Very surprising, given neither Opus nor Sonnet suffered such problem.

Longest frontend task was ~2H. Backend, 8H.

Though none of the tasks were related to developing LLMs, (just production grade secure system that could've been developed 20 years ago, no LLMs involved), it is possible Claude Fable downgraded itself or spitted out fake results. There'd be no way of knowing since Anthropic silently degrades model quality based on undisclosed internal criteria which claims to be about LLMs.

We decided Fable is unpredictable and cannot be trusted to the degree that Opus and Sonnet can be trusted for any projects beyond toy-scale quick wireframes, but Fable can be the best tool for quick UI UX wireframing for non-technical roles.

▲ jasondigitized an hour ago | parent | next [-]

A single 8h task? I'm sorry, but that's just asking for trouble.

▲

queuebert 14 minutes ago | parent | next [-]

I don't understand how some of y'all use these things. I get garbage unless I give them very specific concrete tasks with as much context as possible. Anything that takes more than 30 min is usually a waste because the scope was too large.

▲

maxall4 9 minutes ago | parent | prev [-]

Indeed, according to METR, Mythos only achieved an 80% success rate with 3 hour tasks. https://metr.org/time-horizons/

	▲	jwood27 2 minutes ago \| parent [-]
		Those are tasks that would take a human 3 hours to complete, not tasks that the model works on for 3 hours.

▲ espeed 2 hours ago | parent | prev | next [-]

Run /model after your task to see. Mine keeps downgrading to Opus 4.8, which is a problem because Opus 4.8 keeps no-oping critical security code.

▲ tekacs an hour ago | parent | next [-]

What you're describing only applies to security or biotech downgrades. A downgrade related to the model believing that you're doing something related to model development is invisible and silent and internal.

▲

steveklabnik an hour ago | parent [-]

Anthropic has reversed that decision. (But that just happened so it might have been true during the article's testing.)

	▲	tekacs an hour ago \| parent [-]
		I was just coming here to post this reply to myself! You're absolutely right! :) Honestly so glad to see the reversal.

▲ comboy 2 hours ago | parent | prev [-]

There is in /config "Switch models when a message is flagged" now which can be set to false, but I had no chance to see what happens then, does it just stop or what.

	▲	espeed 2 minutes ago \| parent \| next [-]
		This: Session paused `Fable 5 has safety measures that flag messages on most cybersecurity or biology topics. They may flag safe, normal content as well. These measures let us bring you Mythos-level capability in other areas sooner, and we're working to refine them. Send feedback with /feedback or learn more 1. Switch to Opus 4.8 2. Edit prompt and retry with Fable 5`
	▲	an hour ago \| parent \| prev [-]
		[deleted]

▲ weatherlight 2 hours ago | parent | prev | next [-]

I had almost the opposite experience.

I'm building a compiler for a language without a tracing GC, so a big chunk of the work is around memory management: functional in-place update, reuse analysis, and a Perceus-style reference-counting strategy similar to what Koka uses. The hard part was that my use case wasn't exactly covered by the Koka/Perceus paper. The prior art got me maybe 75% of the way there, but the remaining 25% was a cluster of bugs with very similar shapes and no obvious published solution.

With Opus, I kept getting stuck in this loop where it would fix one case, but break another case elsewhere in codegen. We ended up with something like 16 failed experiments just for one bug class. The workflow was: run an experiment, identify the shape of the bug, propose a fix, check whether it emitted the correct Zig, then see if the fix broke any previous memory-management cases. It was useful, but it kept choking on the parts where there wasn't clean prior art to lean on.

Fable was a different story for me. It one-shotted the Class A bug cluster, and then basically said "by the way, your previous attempts have these structural problems." More importantly, it identified the other related bug classes and came up with workable strategies for applying the Perceus-style memory management in those shapes too.

That's obviously anecdotal, and I'm not claiming Fable is universally better. But in my case, this was not a toy frontend wireframe. It was compiler work involving ownership, reuse, RC/drop behavior, and Zig codegen. The thing that surprised me was that Fable seemed better precisely where the problem wasn't just "reproduce known prior art", but required filling in a missing piece.

Also worth noting: I'm not using the API. I'm using the Max plan, so maybe there are product-path differences here. But I definitely did not have the "unpredictable beyond toy-scale" experience. For this particular compiler/memory-management problem, it probably saved me a ridiculous amount of time and money.

▲

comboy 2 hours ago | parent | next [-]

'by the way, your previous attempts have these structural problems."

Just to be clear, it did not have access to any previous work that opus did? Because they are pretty good at digging out relevant tmp files and making use of whatever is out there.

With my fable adventures I caught it hallucinating something and stating it as a fact in CLI twice. And it was something that I did not see opus do in such way, opus obviously many times stated some things that it did not verify but guessed, but fable said something like "the probe showed that ..." - but there was no probe, it was not about some past events it was about what it was doing right now. "I overstated"...

But boy does it know Chinese, so much better than any other english model, gemini used to be the king but fable clearly was trained on a decent amount of it. It has a deep cultural understanding.

	▲	weatherlight 26 minutes ago \| parent [-]
		Yes, iit had access. Thats actually the point. I maintain a failure registry in the repo. Every failed attempt gets documented with the exact mechanism, the test that regressed, the revert SHA, and an instruction to start from that frontier. Fable read all of it. But so did Opus. Each of the 16 Opus failures ran in the same harness with the same accumulating registry. By attempt 15, it had disproofs 1–14 in context. By the end, Opus had basically the same corpus that Fable started with, and it still kept failing, sometimes by re-deriving an already-disproved approach in a slightly different shape. So “it leveraged the previous work” doesn’t really separate them. Both had the leverage. Only one converted it. What changed wasn’t more context. It was that Fable rejected a premise inside the context. The registry’s standing framing was: “this needs whole-program borrow inference, which conflicts with per-module incrementality” (architecturally blocked.) Fable ran around 5 fresh attempts in-session, hit the same wall, and then noticed the framing was a red herring: the borrow analysis already runs module-wide, and for a single-module program, the module is the whole program. Opus read that same framing for months and treated it as a constraint. Fable falsified it. its the same repo, same rules, same disproof history, same workflow. The model was the only variable that changed, and the outcome flipped. Is it possible that attempt 17 by Opus could have figured it out? sure. but there's 16 previous attempts that say otherwise. As fars as anecdotes go, that’s about as controlled as it gets.

▲

miroljub 31 minutes ago | parent | prev | next [-]

Zig is one of the worst targets for LLM generated code. It's nice that Fable has better support for Zig than Opus, but this anecdote is not representative as a general use case.

	▲	weatherlight 20 minutes ago \| parent [-]
		Slight misunderstanding. The LLM didn't generate Zig. My compiler does. The model's work was in the Rust compiler internals, specifically the borrow-inference and refcount-insertion passes (Perceus-style ownership analysis). Zig is just the compiler's codegen target, the same way another compiler might emit LLVM IR or C. The only Zig written by hand is the runtime: allocator code, RC primitives, list/string operations, etc. It's pure Zig, no libc, but it's small, stable, and was mostly untouched during this work. The model only touched Zig indirectly, by reading the compiler's generated output to verify whether a fix worked. For example: checking that a drop was emitted before a parameter-slot reassignment. That's reading machine-generated code for correctness, not "the LLM writes Zig." Both models handled that part fine. The 16 failures vs. 1 success were all in the ownership analysis, and that code is Rust.

▲

cmenge 2 hours ago | parent | prev [-]

Similar. I gave it a really hard task, basically messy code in a complex domain that was bug-ridden from a mess previously created half manually and half by Opus. It cleaned things up beautifully, both the backend and the frontend.

Maybe the prompt was particularly well-suited for the model (I instructed it to put on a mathematician's hat, look at the mathematical substructure of the problem, identify invariants and general laws and verify them, then plan how to remediate).

It wrote a ca. 800 line in-depth analysis (at times spawning over 130 research agents...) with remediation plans, prioritized them and then implemented them. One issue was that this document was frankly over my head. Both the language it used and the mathematical parts were very terse, and in parts it felt like a post-C2-vocab exercise. The prose was much harder to understand than the code snippets / data models. As a non-native speaker, it lost me on the prose part, and had to ask it for a less elaborate version to actually understand it.

It burned the session limit four times, but it turned a huge mess of proof-of-concepts with patchy glueing into a coherent, stable application.

I'm also on the Max plan using Claude Code, and I have the feeling that the harness is much more important than the consensus expectation.

▲ 2 hours ago | parent | prev | next [-]

[deleted]

▲ alasano 2 hours ago | parent | prev [-]

There's an often hard to express subjective experience you get with a new model, especially if you spend a lot of time trying out different ones.

I believe the people who feel like Fable is a big improvement, for me it's just much more reasonable and grounded.

It makes me realize how much of a try hard over optimizing planner GPT 5.5 can be. I've been fighting it often to simplify plans.

But no matter the model you can't trust them to actually deliver on very long tasks while maintaining quality. At least not without external orchestration and review.