| ▲ | simonw 2 hours ago | ||||||||||||||||
Here are my notes and pelican benchmark, including a new, harder benchmark because the old one was getting too easy: https://simonwillison.net/2025/Nov/18/gemini-3/ | |||||||||||||||||
| ▲ | mtrovo an hour ago | parent | next [-] | ||||||||||||||||
It's interesting that you mentioned on a recent post that saturation on the pelican benchmark isn't a problem because it's easy to test for generalization. But now looking at your updated benchmark results, I'm not sure I agree. Have the main labs been climbing the Pelican on a bike hill in secret this whole time? | |||||||||||||||||
| ▲ | skylurk an hour ago | parent | prev | next [-] | ||||||||||||||||
They've been training for months to draw that pelican, just for you to move the goalposts. | |||||||||||||||||
| ▲ | Thrymr 39 minutes ago | parent | prev | next [-] | ||||||||||||||||
Considering how many other "pelican riding a bicycle" comments there are in this thread, it would be surprising if this was not already incorporated in the training data. If not now, soon. | |||||||||||||||||
| ▲ | libraryofbabel an hour ago | parent | prev [-] | ||||||||||||||||
I was interested (and slightly disappointed) to read that the knowledge cutoff for Gemini 3 is the same as for Gemini 2.5: January 2025. I wonder why they didn't train it on more recent data. Is it possible they use the same base pre-trained model and just fine-tuned and RL-ed it better (which, of course, is where all the secret sauce training magic is these days anyhow)? That would be odd, especially for a major version bump, but it's sort of what having the same training cutoff points to? | |||||||||||||||||
| |||||||||||||||||