Remix.run Logo
sigmar 5 hours ago

blog post is up- https://blog.google/innovation-and-ai/models-and-research/ge...

edit: biggest benchmark changes from 3 pro:

arc-agi-2 score went from 31.1% -> 77.1%

apex-agents score went from 18.4% -> 33.5%

ripbozo 4 hours ago | parent | next [-]

Does the arc-agi-2 score more than doubling in a .1 release indicate benchmark-maxing? Though i dont know what arc-agi-2 actually tests

maxall4 4 hours ago | parent | next [-]

Theoretically, you can’t benchmaxx ARC-AGI, but I too am suspect of such a large improvement, especially since the improvement on other benchmarks is not of the same order.

energy123 3 hours ago | parent | prev | next [-]

Francois Chollet accuses the big labs of targeting the benchmark, yes. It is benchmaxxed.

tasuki 2 hours ago | parent | next [-]

Didn't the same Francois Chollet claim that this was the Real Test of Intelligence? If they target it, perhaps they target... real intelligence?

CamperBob2 3 hours ago | parent | prev [-]

I don't know what he could mean by that, as the whole idea behind ARC-AGI is to "target the benchmark." Got any links that explain further?

layer8 2 hours ago | parent [-]

The fact that ARC-AGI has public and semi-private in addition to private datasets might explain it: https://arcprize.org/arc-agi/2/#dataset-structure

blinding-streak 4 hours ago | parent | prev | next [-]

I assume all the frontier models are benchmaxxing, so it would make sense

boplicity 4 hours ago | parent | prev [-]

Benchmark maxing could be interpreted as benchmarks actually being a design framework? I'm sure there are pitfalls to this, but it's not necessarily bad either.

sho_hn 5 hours ago | parent | prev [-]

The touted SVG improvements make me excited for animated pelicans.

takoid 4 hours ago | parent | next [-]

I just gave it a shot and this is what I got: https://codepen.io/takoid/pen/wBWLOKj

The model thought for over 5 minutes to produce this. It's not quite photorealistic (some parts are definitely "off"), but this is definitely a significant leap in complexity.

tasuki 41 minutes ago | parent | next [-]

That's a good pelican. What I like the most is that the SVG is nice and readable. If only Inkscape could output nice SVG like this!

makeavish 4 hours ago | parent | prev | next [-]

Looks great!

benatkin 4 hours ago | parent | prev [-]

Here's what I got from Gemini Pro on gemini.google.com, it thought for under a minute...might you have been using AI studio? https://jsbin.com/zopekaquga/edit?html,output

It does say 3.1 in the Pro dropdown box in the message sending component.

james2doyle 4 hours ago | parent | prev | next [-]

The blog post includes a video showcasing the improvements. Looks really impressive: https://blog.google/innovation-and-ai/models-and-research/ge...

aoeusnth1 4 hours ago | parent | prev | next [-]

I imagine they're also benchgooning on SVG generation

rdtsc 2 hours ago | parent | prev | next [-]

My perennial joke is as soon as that got on HN front page Google went and hired some interns and they spend a 100% of the time on pelicans.

vunderba 3 hours ago | parent | prev | next [-]

SVG is an under-rated use case for LLMs because it gives you the scalability of vector graphics along with CSS-style interactivity (hover effects, animations, transitions, etc.).

DonHopkins 2 hours ago | parent | prev [-]

How about STL files for 3d printing pelicans!