Remix.run Logo
segmondy 3 hours ago

For those claiming they rigged it. Do you have any concrete evidence? What if the models have just gotten really good?

I just asked Gemini pro to generate an SVG of an octopus dunking a basketball and it did a great job. Not even Deep Think model. Then I did "generate an svg of raccoon at a beach drinking a beer" you can go try this out yourself. Ask it to generate anything you want in SVG. use your imagination.

Rant: This is why AI is going to take over, folks are not even trying the least.

JumpCrisscross 3 hours ago | parent | next [-]

> What if the models have just gotten really good?

Kagi Assistant remains my main way of interacting with AI. One of its benefits is you're encouraged to try different models.

The heterogeneity in competence, particular per unit in time, is growing rapidly. If I'm extrapolating image-creation capabilities from Claude, I'm going to underestimate what Gemini can do without fuckery. Likewise, if I'm using Grok all day, Gemini and Claude will seem unbelievably competent when it comes to deep research.

raincole 2 hours ago | parent | prev | next [-]

Every bit of improvement on AI ability will have the corresponding denial phrase. Some people still think AI can't generate the correct number of fingers today.

halJordan an hour ago | parent [-]

I love to hate it when someone unironically thinks asking an llm how many letters are in a word is a good test

Jerrrrrrrry 7 minutes ago | parent [-]

It is a good test now, for reasoning models.

It was a terrible test for pure tokenized models, because the logit that carries the carry digit during summation has a decent chance at getting lost.

SOTA models should reason to generate a function that returns the count of a given character, evaluate the function with tests, and use it for the output.

WarmWash 3 hours ago | parent | prev | next [-]

Simon has a private set of SVG tests he uses as well. He said that the private ones were just as impressive.

irthomasthomas 2 hours ago | parent | prev | next [-]

Why frame it as rigging? I assume they would teach the models to improve on tasks the public find interesting. Then we just have to come up with more challenges for it.

krackers an hour ago | parent [-]

It's not rigging—it's just RL.

bayindirh 2 hours ago | parent | prev | next [-]

> For those claiming they rigged it.

I don't think they "rigged" it, but might be given a bit more push on that part since it's going for a very long time now.

Another benchmark is going on at [0]. It's pretty interesting. A perfect scoring model "borks" in the next iteration, for example.

> Rant: This is why AI is going to take over, folks are not even trying the least.

It might be drawing things alright, at least some cases. I seldom use it when my hours long researches doesn't take me to the place I want, and guess what? AI can't go there, either. It hallucinates things, makes up stuff, etc. For a couple of things I asked, it managed to find a single reference, and it was the thing I was looking for, so it works rarely in my cases.

Rant: This is why people are delusional. They test the happy path and claims it knows all the paths, and then some.

[0]: https://clocks.brianmoore.com/

ej88 43 minutes ago | parent | prev | next [-]

"not enough people are emotionally prepared for if it’s not a bubble"

colecut 3 hours ago | parent | prev | next [-]

and it will be folks using AI taking over for at least a while...

Some people try, most people don't.

AI makes doing almost anything easier for the people that do..

Despite the prophesied near-term obliteration of white collar work, I've never felt luckier to work in software.

dw_arthur 2 hours ago | parent | prev [-]

Everyone should have their own private evals for models. If I ask a question and a model flat out gets it wrong sometimes I will put it in my test questions bank.