Remix.run Logo
simonw 2 hours ago

Here are my notes and pelican benchmark, including a new, harder benchmark because the old one was getting too easy: https://simonwillison.net/2025/Nov/18/gemini-3/

mtrovo an hour ago | parent | next [-]

It's interesting that you mentioned on a recent post that saturation on the pelican benchmark isn't a problem because it's easy to test for generalization. But now looking at your updated benchmark results, I'm not sure I agree. Have the main labs been climbing the Pelican on a bike hill in secret this whole time?

skylurk an hour ago | parent | prev | next [-]

They've been training for months to draw that pelican, just for you to move the goalposts.

Thrymr 39 minutes ago | parent | prev | next [-]

Considering how many other "pelican riding a bicycle" comments there are in this thread, it would be surprising if this was not already incorporated in the training data. If not now, soon.

libraryofbabel an hour ago | parent | prev [-]

I was interested (and slightly disappointed) to read that the knowledge cutoff for Gemini 3 is the same as for Gemini 2.5: January 2025. I wonder why they didn't train it on more recent data.

Is it possible they use the same base pre-trained model and just fine-tuned and RL-ed it better (which, of course, is where all the secret sauce training magic is these days anyhow)? That would be odd, especially for a major version bump, but it's sort of what having the same training cutoff points to?

simonw an hour ago | parent [-]

The model card says: https://storage.googleapis.com/deepmind-media/Model-Cards/Ge...

> This model is not a modification or a fine-tune of a prior model.

I'm curious why they decided not to update the training data cutoff date too.

stocksinsmocks 36 minutes ago | parent [-]

Maybe that date is a rule of thumb for when AI generated content became so widespread that it is likely to have contaminated future data. Given that people have spoofed authentic Reddit users with Markov chains, it probably doesn’t go back nearly far enough.