Added one more test, which surprisingly gemini flash 3 reasoning passes, but gemini 3.1 pro not

XCSme 2 hours ago | parent | prev | next [-]

Now I need to write more tests.

It's a bit hard to trick reasoning models, because they explore a lot of the angles of a problem, and they might accidentally have an "a-ha" moment that leads them on the right path. It's a bit like doing random sampling and stumbling upon the right result after doing gradient descent from those points.

▲

thevinter 17 minutes ago | parent | prev [-]

Are you intentionally keeping the benchmarks private?

	▲	XCSme 7 minutes ago \| parent [-]
		Yes. I am trying to think what's the best way to give most information about how the AI models fail, without revealing information that can help them overfit on those specific tests. I am planning to add some extra LLM calls, to summarize the failure reason, without revealing the test.