| ▲ | m101 2 hours ago | |||||||
I've been making an auction site and have been using an AI swarm to test it: sellers, intermediaries, buyers, market practices/norms etc. I was mostly using GPT 5.5 xhigh to code up the scenario, and looping over it to check with opus 4.8. Out of curiosity I asked Fable to review it all and I was shocked to find that there were a lot of blindingly obvious common sense mistakes that got through, for example: - all intermediaries were given the prices of all buyers up front - private price information in certain auction types was actually being broadcast to everyone - multiple contradictions in instructions If it was any one of these things then I might have understood - but the fact that so many got passed both Opus and GPT 5.5 makes me think that Fable has something special. This is a common sense type thing, that I think you only get to notice when your task doesn't involve a measurable metric, but rather some sort of real world fuzzy task. There's clearly a problem with all these measures of performance when the difference between these models was night and day in my specific task. | ||||||||
| ▲ | throwwwll 2 hours ago | parent | next [-] | |||||||
Maybe you are something special by letting those slip through in the first place?.. | ||||||||
| ||||||||
| ▲ | pleasstopnw 2 hours ago | parent | prev [-] | |||||||
[dead] | ||||||||