|
| ▲ | dwaltrip 2 days ago | parent | next [-] |
| If they game the pelican benchmark, it’d be pretty obvious. Just try other random, non-realistic things like “a giraffe walking a tightrope”, “a car sitting at a cafe eating a pizza”, etc. If the results are dramatically different, then they gamed it. If they are similar in quality, then they probably didn’t. |
| |
|
| ▲ | ctoth 2 days ago | parent | prev | next [-] |
| > as they do for popular benchmarks or for penguins riding a bike. Citation? |
| |
|
| ▲ | criley2 2 days ago | parent | prev [-] |
| While it is true that model makers are increasingly trying to game benchmarks, it's also true that benchmark-chasing is lowering model quality. GPT 5, 5.1 and 5.2 have been nearly universally panned by almost every class of user, despite being a benchmark monster. In fact, the more OpenAI tries to benchmark-max, the worse their models seem to get. |
| |
| ▲ | astrange 2 days ago | parent | next [-] | | Hm? 5.1 Thinking is much better than 4o or o3. Just don't use the instant model. | |
| ▲ | malnourish a day ago | parent | prev | next [-] | | 5.2 is a solid model and I'm actually impressed with M365 copilot when using it. | |
| ▲ | 2 days ago | parent | prev [-] | | [deleted] |
|