They certainly need the money. The Pro service has been running in limited mode all week due to being over capacity. It defaults to “concise” mode during high capacity but Pro users can select to put it back into “Full Response.” But I can tell the quality drops even when you do that, and it fails and brings up error messages more commonly. They don’t have enough compute to go around.

▲

jmathai 8 months ago | parent | next [-]

I’ve been using the API for a few weeks and routinely get 529 overloaded messages. I wasn’t sure if that’s always been the case but it certainly makes it unsuitable for production workloads because it will last hours at a time.

Hopefully they can add the capacity needed because it’s a lot better than GPT-4o for my intended use case.

▲

rmbyrro 8 months ago | parent | next [-]

Sonnet is better than 4o for virtually all use cases.

The only reason I still use OpenAI's API and chatbot service is o1-preview. o1 is like magic. Everything Sonnet and 4o do poorly, o1 solves like a piece of cake. Architecting, bug fixing, planning, refactoring, o1 has never let me know on any 'hard' task.

A nice combo is have o1 guiding Sonnet. I ask o1 to come up with a solution and explanation, then simply feed its response into Sonnet to execute. That running on Aider really feels like futuristic stuff.

▲

gcanko 8 months ago | parent | next [-]

Exactly my experience as well. Like Sonnet can help me in 90% of the cases but there are some specific edge cases where it struggles that o1 can solve in an instant. I kinda hate it because of having to pay for both of them.

▲

andresgottlieb 8 months ago | parent | next [-]

You should check out Librechat. You can connect different models to it and, instead of paying for both subscriptions, just buy credits for each API.

▲

cruffle_duffle 8 months ago | parent | next [-]

> just buy credits for each API

I’ve always considered doing that but do you come out ahead cost wise?

	▲	esperent 8 months ago \| parent [-]
		I've been using Claude 3.5 over API for about 4 months on $100 of credit. I use it fairly extensively, on mobile and my laptop, and I expected to run out of credit ages ago. However, I am careful to keep chats fairly short as it's long chats that eat up the credit. So I'd say it depends. For my use case it's about even but the API provides better functionality.

▲

joseda-hg 8 months ago | parent | prev [-]

How does the cost compare?

▲

rjh29 8 months ago | parent | prev [-]

I use tabnine, it let's you switch models.

▲

hirvi74 8 months ago | parent | prev [-]

I alluded to this in another comment, but I have 4o to be better than Sonnet in Swift, Obj-C, and Applescript. In my experiences, Claude is worse than useless with those three languages when compared to GPT. Everything else, I'd say the differences haven't been too extreme. Though, o1-preview absolutely smokes both in my experiences too, but it isn't hard for me to hit it's rate limit either.

▲

versteegen 8 months ago | parent | next [-]

Interesting. I haven't compared with 4o or GPT4, but I found DeepSeek 2.5 seems to be better than Claude 3.5 Sonnet (new) at Julia. Although I've seen both Claude and DeepSeek make the exact same sequence of errors (when asked about a certain bug and then given the same reply to their identical mistakes) that shows they don't fully understand the syntax for passing keyword arguments to Julia functions... wow. It was not some kind of tricky case or relevant to the bug. Must have same bad training data. Oops, that's diversion. Actually they're both great in general.

	▲	hirvi74 7 months ago \| parent [-]
		I can see what you mean by LLMs making the same mistakes. I had that experience with both GPT and Claude, as well. However, I found that GPT was better able to correct its mistakes while Claude essentially just doubles down and keeps regurgitating permutations of the same mistakes. I can't tell you how many times I have had Claude spit out something like, "Use the Foobar.ToString() method to convert the value to a string." To which I reply, something like, "Foobar does not have a method 'ToString()'." Then Claude will say something like, "You are right to point out that Foobar does not have a .ToString method! Try Foobar.ConvertToString()" At that point, my frustration levels start to rapidly increase. Have you had experiences like that with Claude or DeepSeek? The main difference with GPT is that GPT tends to find me the right answer after a bit of back-and-forth (or at least point me in a better direction).

▲

rafaelmn 8 months ago | parent | prev [-]

Having used o1 and Claude through Copilot in VSC - Claude is more accurate and faster. A good example is the "fix test" feature is almost always wrong with o1, Claude is 50/50 I'd say - enough to try. Tried on Typescript/node and Python/Django codebases.

None of them are smart enough to figure out integration test failures with edge cases.

▲

AlexAndScripts 8 months ago | parent | prev [-]

Amazon Bedrock supports Claude 3.5, and you can use inference profiles to split it across multiple regions. It's also the same price.

For my use case I use a hybrid of the two, simulating standard rate limits and doing backoff on 529s. It's pretty reliable that way.

Just beware that the European AWS regions have been overloaded for about a month. I had to switch to the American ones.

▲

shmatt 8 months ago | parent | prev | next [-]

in the beginning i was agitated by Concise and would move it back manually. But then I actually tried it, I asked for SQL and it gave me back SQL and 1-2 sentences at most

Regular mode gives SQL and entire paragraphs before and after it. Not even helpful paragraphs, just rambling about nothing and suggesting what my next prompt should be

Now I love concise mode, it doesn't skimp on the meat, just the fluff. Now my problem is, concise only shows up during load. Right now I can't choose it even if i wanted to

▲

cruffle_duffle 8 months ago | parent | next [-]

Totally agree. I wish there was a similar option on ChatGPT. These things are seemingly trained to absolutely love blathering on.

And all that blathering eats into their precious context window with tons of repetition and little new information.

▲

therein 8 months ago | parent [-]

Oh you are asking for a 2 line change? Here is the whole file we have been working on with a preamble and closing remarks, enjoy checking to see if I actually made the change I am referring to in my closing remarks and my condolences if our files have diverged.

▲

cruffle_duffle 8 months ago | parent [-]

You know the craziest thing I’ve seen ChatGPT do is claim to have made a change to my terraform code acting all “ohh here is some changes to reflect all the things you commented on” and all it did was change the comments.

It’s very bizarre when it rewrites the exact same code a second or third time and for some reason decides to change the comments. The comments will have the same meaning but will be slightly different wording. I think this behavior is an interesting window into how large language models work. For whatever reason, despite unchanging repetition, the context window changed just enough it output a statistically similar comment at that juncture. Like all the rest of the code it wrote out was statistically pointing the exact same way but there was just enough variance in how to write the comment it went down a different path in its neural network. And then when it was done with that path it went right back down the “straight line” for the code part.

Pretty wild, these things are.

▲

pertymcpert 8 months ago | parent | next [-]

I don't think the context window has to change for that to happen. The LLMs don't just pick the most likely next token, it's sampled from the distribution of possible tokens so on repeat runs you can get different results.

▲

dimitri-vs 8 months ago | parent | prev [-]

Probably an overcorrection from when people were complaining very vocally about ChatGPT being "lazy" and not providing all the code. FWIW I've seen Claude do the same thing when asked do debug something it obviously did not know how to fix it would just repeatedly refactor the same sections of code and making changes to comments.

	▲	cruffle_duffle 8 months ago \| parent [-]
		I feel like “all the code” and “only the changes” needs to be an actual per chat option. Sometimes you want the changes sometimes you want all the code and it is annoying because it always seems to decide it’s gonna do the opposite of what you wanted… meaning another correction and thus wasted tokens and context. And even worse it pollutes your scroll back with noise.

▲

nmfisher 8 months ago | parent | prev [-]

Agree, concise mode is much better for code. I don’t need you to restate the request or summarize what you did. Just give me the damn code.

	▲	johnisgood 8 months ago \| parent [-]
		An alternative way to the Concise mode would be to add that (or those) sentence(s) yourself, I personally tell it to not give me the code at all at times, and at another times I want the code only, and so forth. You could add these sentences as project instructions, for example, too.

▲

el_benhameen 8 months ago | parent | prev | next [-]

Interesting. I also find it frustrating to be rate limited/have responses fail when I’m paying for the product, but I’ve actually found that the “concise” mode answers have less fluff and make for faster back and forth. I’ve once or twice looked for the concise mode selector when the load wasn’t high.

	▲	rvz 8 months ago \| parent \| next [-]
		All that money and talk of "scale" and yet not only it is slow but costs billions a year to run at normal load and is struggling at high load. This is essentially Google-level load and they can't do it.
	▲	johnisgood 8 months ago \| parent \| prev [-]
		Agreed, I was surprised by it after I first have subscribed to Pro and had a not-that-long chat with it.

▲

moffkalast 8 months ago | parent | prev | next [-]

Their shitty UI is also not doing them any infrastructure favors, during load it'll straight up write 90% of an answer, and then suddenly cancel and delete the whole thing, so you have to start over and waste time generating the entire answer again instead of just continuing for a few more sentences. It's like a DDOS attack where everyone gets preempted and immediately starts refreshing.

	▲	wis 8 months ago \| parent [-]
		Yes! It's infuriating when Claude stops generating mid response and deletes the whole thread/conversation. Not only you lose what it has generated so far, which would've been at least somewhat useful, but you also lose the prompt you wrote, which could've taken you some effort to write.

▲

cma 8 months ago | parent | prev | next [-]

> But I can tell the quality drops even when you do that

Dario said in a recent interview that they never switch to a lower quality model in terms of something with different parameters during times of load. But he left room for interpretation on whether that means they could still use quantization or sparsity. And then additionally, his answer wasn't clear enough to know whether or not they use a lower depth of beam search or other cheaper sampling techniques.

He said the only time you might get a different model itself is when they are A-B testing just before a new announced release.

And I think he clarified this all applied to the webui and not just the API.

(edit: I'm rate limited on hn, here's the source in reply to the below https://www.youtube.com/watch?v=ugvHCXCOmm4&t=42m19s )

▲

dr_dshiv 8 months ago | parent | next [-]

Rate limited on hn! Share more please

	▲	cma 8 months ago \| parent [-]
		https://news.ycombinator.com/item?id=34129956

▲

avarun 8 months ago | parent | prev [-]

Source?

▲

nowahe 8 months ago | parent | prev | next [-]

I've had it refuse to generate a long text response (I was trying to concise a 300kb documentation to 20-30kb to be able to put it in the project's context), and every time I asked it replied "How should structure the results ?", "Shall I go ahead with writing the artifacts now ?", etc.

It wasn't even during the over-capacity event I don't think, and I'm a pro user.

▲

Filligree 8 months ago | parent [-]

Hate to be that guy, but did you tell it up front not to ask? And, of course, in a long-running conversation it's important not to leave such questions in the context.

	▲	nowahe 8 months ago \| parent [-]
		The weird thing is that when I tried to tell it to distill it to a much smaller message it had no problem outputting it without any followup questions. But when I edited my message to ask it to generate a larger response, then I got stuck in the loop of it asking if I was really sure or telling me that `I apologize, but I noticed this request would result in a very large response.` It sparks me as odd, because I've had quite a few times where it would generate me a response over multiple messages (since it was hitting its max message length) without any second-guessing or issue.

▲

neya 8 months ago | parent | prev | next [-]

I am a paying customer with credits and the API endpoints rate-limited me to the point where it's actually unusable as a coding assistant. I use a VS Code extension and it just bailed out in the middle of a migration. I had to revert everything it changed and that was not a pleasant experience, sadly.

▲

square_usual 8 months ago | parent | next [-]

When working with AI coding tools commit early, commit often becomes essential advice. I like that aider makes every change its own commit. I can always manicure the commit history later, I'd rather not lose anything when the AI can make destructive changes to code.

	▲	webstrand 8 months ago \| parent [-]
		I can recommend https://github.com/tkellogg/dura for making auto-commits without polluting main branch history, if your tool doesn't support it natively

▲

teaearlgraycold 8 months ago | parent | prev | next [-]

Why not just continue the migration manually?

▲

htrp 8 months ago | parent | prev | next [-]

Control your own inference endpoints.

▲

its_down_again 8 months ago | parent [-]

Could you explain more on how to do this? e.g if I am using the Claude API in my service, how would you suggest I go about setting up and controlling my own inference endpoint?

	▲	handfuloflight 8 months ago \| parent \| next [-]
		You can't. He means by using the open source models.
	▲	datavirtue 8 months ago \| parent \| prev [-]
		Runa local LLM tuned for coding on LM Studio. It has a server and provides endpoints.

▲

datavirtue 8 months ago | parent | prev [-]

You aren't running against a local LLM?

▲

TeMPOraL 8 months ago | parent | next [-]

That's like asking if they aren't paying the neighborhood drunk with wine bottles for doing house remodeling, instead of hiring a renovation crew.

▲

rybosome 8 months ago | parent | next [-]

That’s funny, but open weight, local models are pretty usable depending on the task.

▲

TeMPOraL 8 months ago | parent [-]

You're right, but that's also subject to compute costs and time value of money. The calculus is different for companies trying to exploit language models in some way, and different for individuals like me who have to feed the family before splurging for a new GPU, or setting up servers in the cloud, when I can get better value by paying OpenAI or Claude a few dollars and use their SOTA models until those dollars run out.

FWIW, I am a strong supporter of local models, and play with them often. It's just that for practical use, the models I can run locally (RTX 4070 TI) mostly suck, and the models I could run in the cloud don't seem worth the effort (and cost).

	▲	alwayslikethis 8 months ago \| parent \| next [-]
		For the money for a 4070ti, you could have bought a 3090, which although less efficient, can run bigger models like Qwen2.5 32b coder. Apparently it performs quite well for code
	▲	rjh29 8 months ago \| parent \| prev [-]
		I guess the cost model doesn't work because you're buying gpu that you use about 0.1% of the day

▲

neumann 8 months ago | parent | prev [-]

That's what my grandma did in the village in Hungary. But with schnapps. And the drunk was also the professional renovation crew.

▲

rty32 8 months ago | parent | prev [-]

Not everyone has a 4090 or M4 Max at home.

▲

0xDEAFBEAD 8 months ago | parent | prev | next [-]

More evidence that people should use wrappers like OpenRouter and litellm by default? (Makes it easy to change your choice of LLMs, if one is experiencing problems)

▲

llm_trw 8 months ago | parent | prev | next [-]

Neither does OAI. Their service has been struggling for more than a week now. I guess everyone is scrambling after the new qwen models dropped and matched the current state of the art with open weights.

▲

sbuttgereit 8 months ago | parent | prev | next [-]

Hmmm... I wonder if this is why some of the results I've gotten over the past few days have been pretty bad. It's easy to dismiss poor results on LLM quality variance from prompt to prompt vs. something like this where the quality is actively degraded without notification. I can't say this is in fact what I'm experience, but it was noticeable enough I'm going to check.

▲

jmathai 8 months ago | parent | next [-]

Never occurred to me that the response changes based on load. I’ve definitely noticed it seems smarter at times. Makes evaluating results nearly impossible.

▲

kridsdale1 8 months ago | parent [-]

My human responses degrade when I’m heavily loaded and low on resources, too.

	▲	TeMPOraL 8 months ago \| parent [-]
		Unrelated. Inference doesn't run in sync with the wall clock; it takes whatever it takes. The issue is more like telling a room of support workers they are free to half-ass the work if there's too many calls, so they don't reject any until even half-assing doesn't lighten the load enough.

▲

Seattle3503 8 months ago | parent | prev | next [-]

This is one reason closed models suck. You can't tell if the bad responses are due to something you are doing, or if the company you are paying to generate the responses is cutting corners and looking for efficiencies, eg by reducing the number of bits. It is a black box.

▲

mirsadm 8 months ago | parent [-]

To be fair even if you did know it would still behave the same way.

	▲	TeMPOraL 8 months ago \| parent [-]
		Still, knowing is what makes the difference between gaslighting and merely subpar/inconsistent service.

▲

baxtr 8 months ago | parent | prev | next [-]

Recently I started wondering about the quality of ChatGPT. A couple of instances I was like: "hmm, I’m not impressed at all by this answer, I better google it myself!"

Maybe it’s the same effect over there as well.

	▲	dave84 8 months ago \| parent [-]
		Recently I asked 4o to ‘try again’ when it failed to respond fully, it started telling me about some song called Try Again. It seems to lose context a lot in the conversations now.

▲

55555 8 months ago | parent | prev [-]

Same experience here.

▲

8 months ago | parent | prev [-]

[deleted]