Mandatory pelican on bicycle: https://www.svgviewer.dev/s/U6nJNr1Z

kennykartman 8 hours ago | parent | next [-]

Ah ah I was curious about that! I wonder if (when? if not already) some company is using some version of this in their training set. I'm still impressed by the fact that this benchmark has been out for so long and yet produce this kind of (ugly?) results.

▲

saberience 8 hours ago | parent | next [-]

Because no one cares about optimizing for this because it's a stupid benchmark.

It doesn't mean anything. No frontier lab is trying hard to improve the way its model produces SVG format files.

I would also add, the frontier labs are spending all their post-training time on working on the shit that is actually making them money: i.e. writing code and improving tool calling.

The Pelican on a bicycle thing is funny, yes, but it doesn't really translate into more revenue for AI labs so there's a reason it's not radically improving over time.

▲

simonw 8 hours ago | parent | next [-]

+1 to "it's a stupid benchmark".

	▲	esafak 5 hours ago \| parent [-]
		You can always suggest a new one ;)

▲

obidee2 7 hours ago | parent | prev | next [-]

Why stupid? Vector images are widely used and extremely useful directly and to render raster images at different scales. It’s also highly connected with spacial and geometric reasoning and precision, which would open up a whole new class of problems these models could tackle. Sure, it’s secondary to raster image analysis and generation, but curious why it would be stupid to persue?

▲

storystarling 7 hours ago | parent | prev | next [-]

I suspect there is actually quite a bit of money on the table here. For those of us running print-on-demand workflows, the current raster-to-vector pipeline is incredibly brittle and expensive to maintain. Reliable native SVG generation would solve a massive architectural headache for physical product creation.

▲

lofaszvanitt 7 hours ago | parent | prev [-]

It shows that these are nowhere near anything resembling human intelligence. You wouldn't have to optimize for anything if it would be a general intelligence of sorts.

▲

CamperBob2 7 hours ago | parent [-]

Here's a pencil and paper. Let's see your SVG pelican.

▲

vladms 7 hours ago | parent | next [-]

So you think if would give a pencil and a paper to the model would it do better?

I don't think SVG is the problem. It just shows that models are fragile (nothing new) so even if they can (probably) make a good PNG with a pelican on a bike, and they can make (probably) make some good SVG, they do not "transfer" things because they do not "understand them".

I do expect models to fail randomly in tasks that are not "average and common" so for me personally the benchmark is not very useful (and that does not mean they can't work, just that I would not bet on it). If there are people that think "if an LLM outputted an SVG for my request it means it can output an SVG for every image", there might be some value.

▲

zebomon 7 hours ago | parent | prev [-]

This exactly. I don't understand the argument that seems to be, if it were real intelligence, it would never have to learn anything. It's machine learning, not machine magic.

	▲	CamperBob2 7 hours ago \| parent [-]
		One aspect worth considering is that, given a human who knows HTML and graphics coding but who had never heard of SVG, they could be expected to perform such a task (eventually) if given a chance to train on SVG from the spec. Current-gen LLMs might be able to do that with in-context learning, but if limited to pretraining alone, or even pretraining followed by post-training, would one book be enough to impart genuine SVG composition and interpretation skills to the model weights themselves? My understanding is that the answer would be no, a single copy of the SVG spec would not be anywhere near enough to make the resulting base model any good at SVG authorship. Quite a few other examples and references would be needed in either pretraining, post-training or both. So one measure of AGI -- necessary but not sufficient on its own -- might be the ability to gain knowledge and skills with no more exposure to training material than a human student would be given. We shouldn't have to feed it terabytes of highly-redundant training material, as we do now, and spend hundreds of GWh to make it stick. Of course that could change by 5 PM today, the way things are going...

▲

NitpickLawyer 8 hours ago | parent | prev | next [-]

It would be trivial to detect such gaming, tho. That's the beauty of the test, and that's why they're probably not doing it. If a model draws "perfect" (whatever that means) pelicans on a bike, you start testing for owls riding a lawnmower, or crows riding a unicycle, or x _verb_ on y ...

	▲	Sharlin 7 hours ago \| parent [-]
		It could still be special-case RLHF trained, just not up to perfection.

▲

derefr 7 hours ago | parent | prev [-]

It’d be difficult to use in any automated process, as the judgement for how good one of these renditions is, is very qualitative.

You could try to rasterize the SVG and then use an image2text model to describe it, but I suspect it would just “see through” any flaws in the depiction and describe it as “a pelican on a bicycle” anyway.

▲

lofaszvanitt 7 hours ago | parent | prev [-]

A salivating pelican :D.