Remix.run Logo
simianwords 2 hours ago

You had to go all the way and find it in the benchmark results that specifically stress test this.

You could not come up with a single one yourself. And you also linked an example where it was not allowed to use tools when I specifically said that it should be able to use tools. I'm not sure why are you present this as though it is a big gotcha.

I think my main point pretty much stands.

camgunz 2 hours ago | parent [-]

I found over 500 examples that fit your criteria. Embarrassing you were arguing in bad faith this whole time.

simianwords 2 hours ago | parent [-]

They all use the tool search, no? Please correct me if I'm wrong.

My criteria was using ChatGPT which explicitly allows it.

https://arxiv.org/html/2511.13029v1 if you don't believe me.

BTW this was your original point

>Anyway, it's trivial to get pretty much any model to make things up. Don't we all know this? That's why I was surprised by your position; if we know anything about these things it's that they make things up.

And look at how much effort you have had to do

1. use the wrong model for the horns example

2. the game one also didn't work

3. now you are searching for examples in literal benchmarks and you are still not able to find any

How is this trivial in any interpretation of the word?

I think it would be perfectly reasonable to agree that it is not at all trivial to find counter examples for my challenge.

camgunz an hour ago | parent [-]

I've got about 20 minutes in this; mostly I've been reading wallstreetbets at the Shake Shack bar in the Boston airport. I'm happy to post this over and over again until you engage w/ it:

> I found over 500 examples that fit your criteria.