Remix.run Logo
pvalue005 6 hours ago

I suspect this was released by Anthropic as a DDOS attack on other AI companies. I prompted 'how do we solve this challenge?' into gemini cli in a cloned repo and it's been running non-stop for 20 minutes :)

bjackman 4 hours ago | parent | next [-]

Lately with Gemini CLI / Jules it doesn't seem like time spent is a good proxy for difficulty. It has a big problem with getting into loops of "I am preparing the response for the user. I am done. I will output the answer. I am confident. Etc etc".

I see this directly in Gemini CLI as the harness detects loops and bails the reasoning. But I've also just occasionally seen it take 15m+ to do trivial stuff and I suspect that's a symptom of a similar issue.

mixel an hour ago | parent | next [-]

I saw this too. Sometimes it "think" inside of the actual output and its much more likely to end up in the loop of "I am ready to answer" while it is doing that already

sva_ an hour ago | parent | prev [-]

I feel like sometimes it just loops those messages when it doesn't actually generate new tokens. But I might be wrong

bjackman 39 minutes ago | parent [-]

There are some other failure modes that all feel kinda vaguely related that probably help with building a hypothesis about what's going wrong:

Sometimes Gemini tools will just randomly stop and pass the buck back to you. The last thing will be like "I will read the <blah> code to understand <blah>" and then it waits for another prompt. So I just type "continue" and it starts work again.

And, sometimes it will spit out the internal CoT directly instead of the text that's actually supposed to be user-visible. So sometimes I'll see a bunch of paragraphs starting with "Wait, " as it works stuff out and then at the end it says "I understand the issue" or whatever, then it waits for a prompt. I type "summarise" and it gives me the bit I actually wanted.

It feels like all these things are related and probably have to do with the higher-level orchestration of the product. Like I assume there are a whole bunch of models feeding data back and forth to each other to form the user-visible behaviour, and something is wrong at that level.

bird0861 5 hours ago | parent | prev [-]

Which Gemini model did you use? My experience since launch of G3Pro has been that it absolutely sucks dog crap through a coffee straw.

pvalue005 4 hours ago | parent | next [-]

/model: Auto (Gemini 3) Let Gemini CLI decide the best model for the task: gemini-3-pro, gemini-3-flash

After ~40 minutes, it got to:

The final result is 2799 cycles, a 52x speedup over the baseline. I successfully implemented Register Residency, Loop Unrolling, and optimized Index Updates to achieve this, passing all correctness and baseline speedup tests. While I didn't beat the Opus benchmarks due to the complexity of Broadcast Optimization hazards, the performance gain is substantial.

It's impressive as I definitely won't be able to do what it did. I don't know most of the optimization techniques it listed there.

I think it's over. I can't compete with coding agents now. Fortunately I've saved enough to buy some 10 acre farm in Oregon and start learning to grow some veggies and raise chickens.

apsurd 3 hours ago | parent [-]

we've lost the plot.

you can't compete with an AI on doing an AI performance benchmark?

kqr 2 hours ago | parent [-]

This is not an AI performance benchmark, this is an actual exercise given to potential human employees during a recruitment process.

Mashimo 5 hours ago | parent | prev [-]

> sucks dog crap through a coffee straw.

That would be impressive.

anematode 4 hours ago | parent [-]

New LLM benchmark incoming? I bet once it's done, people will still say it's not AGI.

dotancohen 4 hours ago | parent [-]

When they get the hardware capable of that, a different industry will be threatened by AI. The oldest industry.

cess11 4 hours ago | parent [-]

Textile?

nineteen999 3 hours ago | parent [-]

The emperor's (empresses?) new textile.