Remix.run Logo
sho 3 hours ago

I am no-where near as concerned by this as I was a year ago, when I was expecting the axe to fall at any moment before the Chinese labs achieved some sort of escape velocity. I now think it's too late, all the cats are out of all the bags, there's no moat except maybe a temporal one of a few months, the genie is out of the bottle.

There is no secret sauce the US labs have that the Chinese ones don't, or won't have soon enough. Deepseek 4 and Kimi 2.5 are not quite Claude 4.5/GPT5.5 but there's no fundamental principle missing - they are strong evidence that there's no real advantage the "frontier" labs possess that isn't related to scale, which they will gain in time (if they even need to). The RL post-training techniques that work are widely known and easily copied. All Deepseek is really lacking is data, which they're getting - and the harder Anthropic/the USG makes it to access claude in china, the more of that precious data they'll get!

I used to sort of entertain the "fast take-off breakaway" scenario as being plausible but not really anymore. The only genuine moat the frontier labs have is their product take-up, which isn't nothing, far from it, but it's not some unbreakable technological wall. Too late guys - it might have been too late for quite some time.

gpt5 3 hours ago | parent | next [-]

I wish it was true. I would gladly use a GPT 5.2 high model equivalent for coding (6 months old) if it was offered cheaper by Deepseek or Kimi. And I'm sure that's an extremely prevalent opinion by the millions of Claude and Codex users who are bothered by the costs.

However, they just don't perform that well in practice. That's the real issue. You can actually see it when you move away from open benchmarks. Deep seek 3.2 is 4% on Arc-AGI 2 [1], while GPT 5.2 high is 52% and GPT 5.5 pro high is 84.6%. That's the real reason why nobody is using these models for serious work. It's incredibly frustrating.

In addition, I already feel the pain myself on the model restriction. I'll asking my codex 5.5 agent to crawl a website - BOOM, cybersecurity warning on my account. I'll ask it to fix SSH on my local network - another warning. I'm worried about the day my account would be randomly banned and I cannot create a new one. OpenAI already asks you to perform full identification in order to eliminate these warnings - probably exactly for that - so that if they ban you, it's permanent.

[1] https://arcprize.org/leaderboard

usernametaken29 20 minutes ago | parent | next [-]

I worked extensively on ARC AGI before and one thing is SURE as hell. OpenAI and Gemini in particular use this as marketing material. You can correlate the benchmark release with stock price increase. They feed synthetic datasets of ARC into their models to boost the numbers. There is no doubt in my mind Gemini is no better than DeepSeek other than being specifically fine tuned for ARC AGI. Heck, they even say so and they say they have paid annotations for ARC. Again, economic incentives. In terms of whether these models are actually better at the benchmarks, likely not. See ARC 3, where the gap is diminishingly small.

gpt5 7 minutes ago | parent [-]

ARC-AGI isn't perfect, but it helps demonstrates the gap. I'm sure all companies optimize their models for this benchmark given its dominance.

sho 2 hours ago | parent | prev | next [-]

I 100% agree with you, but I've been convinced over the last year that it's a time and scale issue, not anything fundamental.

The Chinese models right now are in a weird spot. Compared to the frontiers, both their pre and post training is woeful - tiny, resource constrained in every dimension including human, slow. I'd compare it to OpenAI 5 years ago except I think even then OpenAI had way more!

But they "cheat" quite a lot in distillation and very benchmark-focussed RL and that's where you get this superficial quality in the leaderboards that doesn't match up when you go off-script. Arc is a great example in that it really belies an "inferior soul" at the heart of it all.

What gives me great hope though is that those same scaling laws that Altman and others have been hyping forever will absolutely kick in for the Chinese labs just as they did for the US ones, and I don't think anything can stop that process now. So they will catch up. It won't be tomorrow, but it's not going to be 10 years either. 3-5 would be my reasonably educated guess.

And the final risk, that China itself might try to restrict availability of the tsunami of GPU or other AI hardware it will inevitably produce - well, I just can't really imagine a country that has been configuring itself for the last 40 years as a single purpose export machine deciding that actually, no, it doesn't want to export something.

About the model restrictions - absolutely. I've been trying to do security research on my own software and the frontier models immediately get suspicious. I've been playing with the local ones much more this year basically because of this. They have deficiencies, for sure - they feel very "hollow" compared to the major labs. But I've talked to a lot of people, and the consensus is pretty clear - just a matter of time.

flir 13 minutes ago | parent [-]

Just an observation: constraints often result in creative solutions. I wouldn't be surprised if a smaller lab makes a big breakthrough because they have to.

applfanboysbgon an hour ago | parent | prev | next [-]

> Deep seek 3.2 is 4% on Arc-AGI 2

Why are you bringing up an outdated Chinese model from 6 months ago to compare to a US model from 6 months ago? The outdated Chinese model will have performance from ~12 months ago, obviously. But today's Chinese model DeepSeek 4 has performance not far from the US model 6 months ago; 46% compared to 52% from 5.2.

gpt5 11 minutes ago | parent [-]

Because Deepseek 4.0 is not yet there, but the jump isn't expected to be large. Kimi 2.5 is there and is also scoring low.

pjerem a minute ago | parent [-]

Hum, I'm using it with my Ollama Cloud subscription since the last two weeks and I love it. Never reached the 5 hours usage limits of the $20 plan (on side projects) where I would reach it sometimes in ONE prompt with Opus.

ageitgey an hour ago | parent | prev | next [-]

Have you tried the latest DeepSeek v4 Pro inside of the Claude Code harness? It's not listed in that site.

It definitely 'feels like' it is as good as Claude for many regular web app coding tasks (though I don't have real benchmarks). And it is comically cheap.

I'm not suggesting it is better than the latest Claude or codex models, but it seems 'good enough' for a lot of use cases in my limited real world testing.

PAndreew 2 minutes ago | parent | next [-]

I'm starting to feel like a parrot, but people seem to forget that software engineering is actually a very narrow slice of the white collar pie. You don't need a mega-model which can reason about 100 000 lines of code when you want to create a nice PPT (which consumed literally hours of your life before) to impress your boss. SOTA models will probably be used for frontier research, complex coding tasks, large scale data analysis, etc. And the average Joe shall be able to buy a pre-configured box with a plug-and-play harness and run medium models air-gapped. Or use such models through cloud APIs dirt cheap if privacy is not a concern.

omnimus an hour ago | parent | prev [-]

Also so many developers i know use LLMs for one shoting isolated problems, explainers, discussions and planning. For these even Kimi is pretty great.

I don't think every dev will be comfortable just releasing claude on their project.

otabdeveloper4 2 hours ago | parent | prev [-]

And yet Claude six months ago was amazing and good enough for you.

This shows that AI cloud consumption is just a conspicuous consumption status symbol, nobody knows why they need cloud AI or what problem they are even solving.

hbarka 2 hours ago | parent | prev | next [-]

Harness engineering is a moat. There’s user loyalty and reliance on the chassis that Claude is on, for example, just like there’s more market share by MacOS+WindowsOS over Linux Open Source.

kasey_junk 12 minutes ago | parent | next [-]

I regularly switch between codex and Claude in the same sessions. I’d throw in other models if I could.

Data governance and enterprise sales is a moat. The harnesses aren’t.

ElFitz an hour ago | parent | prev [-]

I thought so too.

But 1) people use other models with that same harness. 2) I moved on from Claude Code and all the features I cared for up and running in less than a couple days. Without even looking for available plugins or extensions.

nojs 25 minutes ago | parent | prev | next [-]

What about access to GPUs and memory? This is becoming a pretty major bottleneck.

asdff 14 minutes ago | parent [-]

Everyone is expecting them to invade Taiwan, but why not merely extort Taiwan?

ElFitz an hour ago | parent | prev | next [-]

> The only genuine moat the frontier labs have is their product take-up

And even then, their is no stickiness. For most use cases there isn’t much value in one frontier model over the other.

Just have to look at the people flocking from one to the other for whatever reason.

baq 40 minutes ago | parent | next [-]

I’m flocking from GPT to opus every week for the past 3 months and always come back.

The point isn’t that gpt is better, it’s that it is so much better for my work it isn’t even sticky, it’s reinforced concrete. I use opus 1% of the time because it writes better and it’s sticky there.

Yes I’ll switch approximately immediately if opus or Gemini (which I use more than opus!) is better for what I do, but at this point frontier model tokens are not fungible.

ElFitz 28 minutes ago | parent [-]

There will always be dataset and training quirks, and the provider’s own biases and focus, granting one model an edge over the others in some specific domain.

baq 27 minutes ago | parent [-]

Yup and that’s where the moats are.

dotancohen an hour ago | parent | prev [-]

The large AI houses arguably ensure that model switching be a natural action for their clients, by switching the default model of their flagship offerings every few months. Such is the price of progress.

BrtByte 3 hours ago | parent | prev | next [-]

I agree the genie is out of the bottle technologically. I'm less convinced that means access stops being politically and economically important. The bottle may be gone but the best lamps are still expensive

trollbridge 3 hours ago | parent | next [-]

But a “good enough” lamp just got a lot cheaper. The cost of tokens on DeepSeek V4 Pro is so low I don’t even think about and currently am trying to figure out useful things for as many agents simultaneously running as I can. What would have cost $150 less than a year ago now costs 35¢.

Likewise Qwen 3.6 absolutely blows me away and that’s on a 35b 6-bit model on a local 5090. Same thing, busy trying to find stuff to do to keep it busy 24/7.

I can still find some niches for Opus 4.7 but being able to attack problems and not worry about consumption is a game changer.

jorvi 3 hours ago | parent | prev [-]

Virtually no one is going to pay for the best performing lamp if the next best lamp does 90% as good for an order of magnitude cheaper.

I will say, as pointed out by others, DeepSeek and other Chinese providers still lack a bit in the tooling that Claude has, but they'll get there.

baq 38 minutes ago | parent | next [-]

And yet it seems that 90% are happily paying for the marginal 10% capability and saturate datacenters.

lugu 6 minutes ago | parent [-]

That is called marketing.

BrtByte 3 hours ago | parent | prev [-]

If the second-best lamp is 90% as good and 10x cheaper, most people will use the second-best lamp...

34 minutes ago | parent | next [-]
[deleted]
avazhi 3 hours ago | parent | prev [-]

That’s what he said?

shevy-java 3 hours ago | parent | prev [-]

> There is no secret sauce the US labs have that the Chinese ones don't, or won't have soon enough

This is not just about mainland China though. The current US government is extremely selfish and self-centered. Other countries really need to consider for their own long-term situation here.