Remix.run Logo
hirvi74 3 days ago

I alluded to this in another comment, but I have 4o to be better than Sonnet in Swift, Obj-C, and Applescript. In my experiences, Claude is worse than useless with those three languages when compared to GPT. Everything else, I'd say the differences haven't been too extreme. Though, o1-preview absolutely smokes both in my experiences too, but it isn't hard for me to hit it's rate limit either.

versteegen 3 days ago | parent | next [-]

Interesting. I haven't compared with 4o or GPT4, but I found DeepSeek 2.5 seems to be better than Claude 3.5 Sonnet (new) at Julia. Although I've seen both Claude and DeepSeek make the exact same sequence of errors (when asked about a certain bug and then given the same reply to their identical mistakes) that shows they don't fully understand the syntax for passing keyword arguments to Julia functions... wow. It was not some kind of tricky case or relevant to the bug. Must have same bad training data. Oops, that's diversion. Actually they're both great in general.

hirvi74 a day ago | parent [-]

I can see what you mean by LLMs making the same mistakes. I had that experience with both GPT and Claude, as well.

However, I found that GPT was better able to correct its mistakes while Claude essentially just doubles down and keeps regurgitating permutations of the same mistakes.

I can't tell you how many times I have had Claude spit out something like, "Use the Foobar.ToString() method to convert the value to a string." To which I reply, something like, "Foobar does not have a method 'ToString()'."

Then Claude will say something like, "You are right to point out that Foobar does not have a .ToString method! Try Foobar.ConvertToString()"

At that point, my frustration levels start to rapidly increase. Have you had experiences like that with Claude or DeepSeek? The main difference with GPT is that GPT tends to find me the right answer after a bit of back-and-forth (or at least point me in a better direction).

rafaelmn 3 days ago | parent | prev [-]

Having used o1 and Claude through Copilot in VSC - Claude is more accurate and faster. A good example is the "fix test" feature is almost always wrong with o1, Claude is 50/50 I'd say - enough to try. Tried on Typescript/node and Python/Django codebases.

None of them are smart enough to figure out integration test failures with edge cases.