I don’t think you’re crazy, something is off in their models.

As an example I’ve been using an MCP tool to provide table schemas to Claude for months.

There was a point where it stopped recognizing the tool unless mentioned in early August. Maybe that’s related to their degraded quality issue.

This morning after pulling the correct schema info Sonnet started hallucinating columns (from Shopify’s API docs) and added them to my query.

That’s a use case I’ve been doing daily for months and in the last few weeks has gone from consistent low supervision to flaky and low quality.

I don’t know what’s going on, Sonnet has definitely felt worse, and the timeline matches their status page incident, but it’s definitely not resolved.

Opus 4.1 also feels flaky, it feels like it’s less consistent about recalling earlier prompt details than 4.0.

I personally am frustrated that there’s no refund or anything after a month of degraded performance, and they’ve had a lot of downtime.

▲

reissbaker 5 days ago | parent | next [-]

FWIW I strongly recommend using some of the recent, good Chinese OSS models. I love GLM-4.5, and Kimi K2 0905 is quite good as well.

▲

jimbo808 5 days ago | parent [-]

I'd like to give these a try - what's your way of using them? I mostly use Claude because of Claude Code. Not sure what agentic coding tools people are using these days with OSS models. I'm not a big fan of manually uploading files into a web UI.

▲

reissbaker 5 days ago | parent | next [-]

The most private way is to use them on your own machine; a Mac Studio maxed out to 512GB RAM can run GLM-4.5 at FP8 with fairly long context, for example.

If you don't have the hardware to run it locally, let me shill my own company for a minute: Synthetic [1] has a $20/month subscription to most of the good open-weight coding LLMs, with higher rate limits than Claude's $20/month sub. And our $60/month sub has higher rate limits than the $200/month maxed-out version of the Claude Max plan.

You can still use Claude Code by using LiteLLM or similar tools that convert Anthropic-style API requests to OpenAI-style API requests; once you have one of those running locally, you override the ANTHROPIC_BASE_URL env var to point to your locally-running proxy. We'll also be shipping an Anthropic-compatible API this week to work with Claude Code directly. Some other good agentic tools you could use instead include Cline, Roo Code, KiloCode, OpenCode, or Octofriend (the last of which we maintain).

1: https://synthetic.new

▲

sheepscreek 5 days ago | parent | next [-]

Very impressed with what you're doing. It's not immediately clear how the prompts and the data is used on the site. Your terms mention a 14 day API retention, but it's not clear if that applies to Octo/the CLI agent and any other forms of subscription usage (not through UI).

If you can find a way to secure the requests even during the 14 day period, or anonymize them while allowing the developers to do their job, you can have my money today. I think privacy/data security is the #1 concern for me, especially if the agents will be supporting me in all kinds of personal tasks.

	▲	reissbaker 4 days ago \| parent [-]
		FWIW the 14 day retention is just to cover accidental log statements being deployed — we don't intentionally store API request prompts or completions after processing at all. We'll probably change our stated policy to no-store since in practice that's what we do (and we get this feedback a lot!)

▲

IgorPartola 5 days ago | parent | prev | next [-]

Is there a possibility of my work leaning to others? Does your staff have the ability to view prompts and responses? Is tenancy shared with other users, or entities other than your company?

This looks really promising since I have also been having all sorts of issues with Claude.

	▲	reissbaker 4 days ago \| parent [-]
		We never train on your prompts or completions, and for the API we don't store longer than 14 days (in fact, we don't ever intentionally store API prompts or completions at all, the 14 day policy was originally just to cover accidental log statements being deployed; we'll probably change it to no-store since it's confusing to say 14 days when we actually don't intentionally store). For the web UI we do have to store, since otherwise we couldn't show you your message history. In terms of tenancy: we have our own dedicated VMs for our Kubernetes cluster via Azure, although I suspect a VM is not equivalent to an entire hardware node. We use Supabase for our Postgres DB, and Redis for ephemeral data; while we don't share access to that to any other company, we don't create a new DB for every user of our service, so there is user multitenancy there. Similarly, the same GPUs may serve many customers — otherwise we'd need to charge enormous amounts for inference. But, the requests themselves aren't intermingled; i.e. if you make a request, it doesn't affect someone else's.

▲

AlecSchueler 5 days ago | parent | prev [-]

How do you store/view the data I send you?

▲

reissbaker 4 days ago | parent [-]

For API prompts or completions, we don't store after we return the completion to your prompt (our privacy policy allows us to store for a maximum of 14 days, just to cover accidental log statement deploys). For the web UI we store them in Postgres, since the web UI lets you view your message history and we wouldn't be able to serve that to you without storing it.

▲

AlecSchueler 4 days ago | parent [-]

https://developer.mozilla.org/en-US/docs/Web/API/Window/loca...

	▲	reissbaker 3 days ago \| parent [-]
		Yeah, localStorage-only doesn't do things like sync across devices or persist if you lose your phone. But since we expose an OpenAI-compatible endpoint, if you don't care about those things there are plenty of LLM clients that will keep your data 100% on-device that you can use instead of the web UI.

▲

billyjobob 5 days ago | parent | prev | next [-]

Both of those models have Anthropic API compatible endpoints, so you just set an environmental variable pointing to them before you run Claude Code.

▲

5 days ago | parent | prev [-]

[deleted]

▲

8note 5 days ago | parent | prev | next [-]

ive been thinking its that my company mcp has blown up in context size, but using claude without claude code, i get context window overflows constantly now.

another option could be a system prompt change to make it too long?

	▲	data-ottawa 5 days ago \| parent [-]
		I think that’s because of the Artifacts feature and how it works. For me after a few revisions it uses a ton of tokens. As a baseline from a real conversation, 270 lines of sql is ~2500 tokens. Every language will be different, this is what I have open. When Claude edits an artifact it seems to keep the revisions in the chat context, plus it’s doing multiple changes per revision. After 10 iterations on a 1k loc artifact (10k tokens) you’re at 100k tokens. claude.ai has a 200k token window according to their docs (not sure if that’s accurate though). Depending on how Claude is doing those in place edits that could be the whole budget right there.

▲

dingnuts 5 days ago | parent | prev [-]

I have read so many anecdotes about so many models that "were great" and aren't now.

I actually think this is psychological bias. It got a few things right early on, and that's what you remember. As time passes, the errors add up, until the memory doesn't match reality. The "new shiny" feeling goes away, and you perceive it for what it really is: a kind of shitty slot machine

> personally am frustrated that there’s no refund or anything after a month of degraded performance

lol, LMAO. A company operates a shitty slot machine at a loss and you're surprised they have "issues" that reduce your usage?

I'm not paying for any of this shit until these companies figure out how to align incentives. If they make more by applying limits, or charge me when the machine makes errors, that's good for them and bad for me! Why should I continue to pay to pull on the slot machine lever?

It's a waste of time and money. I'll be richer and more productive if I just write the code myself, and the result will be better too.

▲

mordymoop 5 days ago | parent | next [-]

I think you’re onto something but it works the opposite way too. When you first start using a new model you are more forgiving because almost by definition you were using a worse model before. You give if the sorts of problems the old model couldn’t do, and the new model can do them; you see only success, and the places where it fails, well, you can’t have it all.

Then after using the new model for a few months you get used to it, you feel like you know what it should be able to do, and when it can’t do that, you’re annoyed. You feel like it got worse. But what happened is your expectations crept up. You’re now constantly riding it at 95% of its capabilities and hitting more edge cases where it messes up. You think you’re doing everything consistently, but you’re not, you’ve dramatically dialed up your expectations and demands relative to what you were doing months ago. I don’t mean “you,” I mean the royal “you”, this is what we all do. If you think your expectations haven’t risen, go back and look at your commits from six months ago and tell me I’m wrong.

▲

adonese 5 days ago | parent | prev | next [-]

Claude has been constantly terrible for the last couple of weeks. You must have seen this, but just in case: https://x.com/claudeai/status/1965208247302029728

▲

lacy_tinpot 5 days ago | parent | prev | next [-]

Except this is a verifiable thing that actually is acknowledged and even tracked by people.

▲

throwaway314155 5 days ago | parent [-]

Go on then. Verify and track it. Or at least cite a source that does.

	▲	5 days ago \| parent \| next [-]
		[deleted]
	▲	fragmede 5 days ago \| parent \| prev [-]
		https://x.com/claudeai/status/1965208247302029728

▲

holoduke 5 days ago | parent | prev | next [-]

You are saying that you are writing mock data, boiler plate code all yourself? I seriously don't believe that. Llms are already much much faster in these tasks. There is no going back there.

▲

reactordev 5 days ago | parent | prev [-]

This is equivalent to people reminiscing about WoW or EverQuest saying gaming peaked back then…

I think you’re right. I think it’s complete bias with a little bit of “it does more tasks now” so it might behave a bit differently to the same prompt.

I also think you’re right that there’s an incentive to dumb it down so you pull the lever more. Just 2 more $1 spins and maybe you’ll hit jackpot.

Really it’s the enshitification of the SOTA for profits and glory.