Over o3 it's only incremental (which backs up the community's general feeling of gpt5 being an incremental improvement over o3), but it's very consistently better. Also worth mentioning that the score of 77% vs. 90% on the sequences round was shockingly good and shows an improvement over the LLM's ability to not just "classify things" (little to no improvement) but really understand the underlying pattern to get the next one right.

▲

catigula 3 days ago | parent | next [-]

How are you determining that it's better?

Care to make a case for it that isn't benchmark (gameable) based?

▲

scrollaway 3 days ago | parent [-]

By that metric, everything is gameable. Any case we'd make for it would be purely based on vibes (and our take on that would not be any more useful than the general community opinion there).

▲

yunwal 3 days ago | parent | next [-]

> By that metric, everything is gameable

Usually in cases like this you would use a testing set created after the model was trained.

▲

catigula 3 days ago | parent | prev [-]

So the answer would be no.

	▲	scrollaway 3 days ago \| parent [-]
		A benchmark is exactly how you measure things reliably instead of "based on vibes". I really don't understand what you're asking or expecting.

▲

Terretta a day ago | parent | prev [-]

Why no test of o3-pro?