Out of curiosity, I gave it the latest project euler problem published on 11/16/2025, very likely out of the training data

Gemini thought for 5m10s before giving me a python snippet that produced the correct answer. The leaderboard says that the 3 fastest human to solve this problem took 14min, 20min and 1h14min respectively

Even thought I expect this sort of problem to very much be in the distribution of what the model has been RL-tuned to do, it's wild that frontier model can now solve in minutes what would take me days

▲

thomasahle 3 hours ago | parent | next [-]

I also used Gemini 3 Pro Preview. It finished it 271s = 4m31s.

Sadly, the answer was wrong.

It also returned 8 "sources", like stackexchange.com, youtube.com, mpmath.org, ncert.nic.in, and kangaroo.org.pk, even though I specifically told it not to use websearch.

Still a useful tool though. It definitely gets the majority of the insights.

Prompt: https://aistudio.google.com/app/prompts?state=%7B%22ids%22:%...

▲

JBiserkov 3 hours ago | parent [-]

The requested prompt does not exist or you do not have access. If you believe the request is correct, make sure you have first allowed AI Studio access to your Google Drive, and then ask the owner to share the prompt with you.

▲

junon 2 hours ago | parent [-]

I thought this was a joke at first. It actually needs drive access to run someone else's prompt. Wild.

	▲	ashdksnndck 2 hours ago \| parent \| next [-]
		On iOS safari, it just says “Allow access to Google Drive to load this Prompt”. When I run into that UI, my first instinct is that the poster of the link is trying to phish me. That they’ve composed some kind of script that wants to read my Google Drive so it can send info back to them. I’m only going to click “allow” if I trust the sender with my data. IMO, if that’s not what is happening, this is awful product design.
	▲	dormento 2 hours ago \| parent \| prev [-]
		Imagine the metrics though. "this quarter we've had a 12% increase on people using AI solutions in their google drive".

▲

qsort 3 hours ago | parent | prev | next [-]

To be fair a lot of the impressive Elo scores models get are simply due to the fact that they're faster: many serious competitive coders could get the same or better results given enough time.

But seeing these results I'd be surprised if by the end of the decade we don't have something that is to these puzzles what Stockfish is to chess. Effectively ground truth and often coming up with solutions that would be absolutely ridiculous for a human to find within a reasonable time limit.

▲

vjerancrnjak 2 hours ago | parent | next [-]

How are they faster? I don’t think any ELO report actually comes from participating at a live coding contest on previously unseen problems.

	▲	qsort 2 hours ago \| parent [-]
		My background is more on math competitions, but all of those things are essentially speed contests. The skill comes from solving hard problems within a strict time limit. If you gave people twice the time, they'd do better, but time is never going to be an issue for a computer. Comparing raw Elo ratings isn't very indicative IMHO, but I do find it plausible that in closed, game-like environments models could indeed achieve the superhuman performance the Elo comparison implies, see my other comment in this thread.

▲

nerdsniper 3 hours ago | parent | prev [-]

I’d love if anyone could provide examples of such AND(“ground truth”, “absolutely ridiculous”) solutions! Even if they took clever humans a long time to create.

I’m curious to explore such fun programming code. But I’m also curious to explore what knowledgeable humans consider to be both “ground truth” as well as “absolutely ridiculous” to create within the usual time constraints.

▲

qsort 2 hours ago | parent [-]

I'm not explaining myself right.

Stockfish is a superhuman chess program. It's routinely used in chess analysis as "ground truth": if Stockfish says you've made a mistake, it's almost certain you did in fact make a mistake[0]. Also, because it's incomparably stronger than even the very best humans, sometimes the moves it suggests are extremely counterintuitive and it would be unrealistic to expect a human to find them in tournament conditions.

Obviously software development in general is way more open-ended, but if we restrict ourselves to puzzles and competitions, which are closed game-like environments, it seems plausible to me that a similar skill level could be achieved with an agent system that's RL'd to death on that task. If you have base models that can get there, even inconsistently so, and an environment where making a lot of attempts is cheap, that's the kind of setup that RL can optimize to the moon and beyond.

I don't predict the future and I'm very skeptical of anybody who claims to do so, correctly predicting the present is already hard enough, I'm just saying that given the progress we've already made I would find plausible that a system like that could be made in a few years. The details of what it would look like are beyond my pay grade.

---

[0] With caveats in endgames, closed positions and whatnot, I'm using it as an example.

▲

pclmulqdq 2 hours ago | parent [-]

Yeah, it is often pointed out as a brilliance in game analysis if a GM makes a move that an engine says is bad and turns out to be good. However, it only happens in very specific positions.

▲

emodendroket 2 hours ago | parent [-]

Does that happen because the player understands some tendency of their opponent that will cause them to not play optimally? Or is it genuinely some flaw in the machine’s analysis?

	▲	thomasahle 31 minutes ago \| parent \| next [-]
		It's only the later if it's a weak browser engine, and it's early enough in the game that the player had studied the position with a cloud engine.
	▲	pclmulqdq 2 hours ago \| parent \| prev [-]
		It can be either one. In closed positions, it is often the latter.

▲

rbjorklin 2 hours ago | parent | prev | next [-]

Your post made me curious to try a problem I have been coming back to ever since ChatGPT was first released: https://open.kattis.com/problems/low

I have had no success using LLM's to solve this particular problem until trying Gemini 3 just now despite solutions to it existing in the training data. This has been my personal litmus test for testing out LLM programming capabilities and a model finally passed.

▲

sedatk 2 hours ago | parent | prev | next [-]

Just to clarify the context for future readers: the latest problem at the moment is #970: https://projecteuler.net/problem=970

	▲	an hour ago \| parent [-]
		[deleted]

▲

thomasahle 3 hours ago | parent | prev | next [-]

I tried it with gpt-5.1 thinking, and it just searched and found a solution online :p

	▲	lairv 3 hours ago \| parent [-]
		Is there a solution to this exact problem, or to related notions (renewal equation etc.)? Anyway seems like nothing beats training on test

▲

bumling 13 minutes ago | parent | prev | next [-]

I asked Grok to write a Python script to solve this and it did it in slightly under ten minutes, after one false start where I'd asked it using a mode that doesn't think deeply enough. Impressive.

▲

id 2 hours ago | parent | prev | next [-]

gpt-5.1 gave me the correct answer after 2m 17s. That includes retrieving the Euler website. I didn't even have to run the Python script, it also did that.

▲

irthomasthomas an hour ago | parent | prev | next [-]

Are you sure it did not retrieve the answer using websearch?

▲

j2kun 2 hours ago | parent | prev | next [-]

Did it search the web?

▲

orly01 4 hours ago | parent | prev [-]

Wow. Sounds pretty impressive.