| ▲ | Marazan 4 days ago | |
Heres my old benchmark question and my new variant: "When was the last time England beat Scotland at rugby union" new variant "Without using search when was the last time England beat Scotland at rugby union" It is amazing how bad ChatGPT is at this question and has been for years now across multiple models. It's not that it gets it wrong - no shade, I've told it not to search the web so this is _hard_ for it - but how badly it reports the answer. Starting from the small stuff - it almost always reports the wrong year, wrong location and wrong score - that's the boring facts stuff that I would expect it to stumble on. It often creates details of matches that didn't exist, cool standard hallucinations. But even within the text it generates itself it cannot keep it consistent with how reality works. It often reports draws as wins for England. It frequently states the team that it just said scored most points lost the match, etc. It is my ur example for when people challenge my assertion LLMs are stochastic parrots or fancy Markov chains on steroids. | ||