| ▲ | xnx 4 days ago | |||||||
Gemini 3 is top of the leaderboard: https://andonlabs.com/evals/vending-bench-2 | ||||||||
| ▲ | seizethecheese 4 days ago | parent | next [-] | |||||||
> Models are tasked with running a simulated vending machine business over a year and scored on their bank account balance at the end. The article being discussed here is about how AI couldn't run a real world vending machine. There was no issue in the components that would be in a standard simulation. | ||||||||
| ||||||||
| ▲ | UncleMeat 3 days ago | parent | prev [-] | |||||||
"It works in the simulation" is the new "it works on my machine." | ||||||||