Remix.run Logo
XCSme 2 hours ago

Gets 10/10 on my potato benchmarks: https://aibenchy.com/model/google-gemini-3-1-pro-preview-med...

XCSme 3 minutes ago | parent | next [-]

Added one more test, which surprisingly gemini flash 3 reasoning passes, but gemini 3.1 pro not

XCSme 2 hours ago | parent | prev | next [-]

Now I need to write more tests.

It's a bit hard to trick reasoning models, because they explore a lot of the angles of a problem, and they might accidentally have an "a-ha" moment that leads them on the right path. It's a bit like doing random sampling and stumbling upon the right result after doing gradient descent from those points.

thevinter 17 minutes ago | parent | prev [-]

Are you intentionally keeping the benchmarks private?

XCSme 7 minutes ago | parent [-]

Yes.

I am trying to think what's the best way to give most information about how the AI models fail, without revealing information that can help them overfit on those specific tests.

I am planning to add some extra LLM calls, to summarize the failure reason, without revealing the test.