Remix.run Logo
saturn8601 an hour ago

Dates matters. Questions I asked about my Mazda a year ago that were total hucillunations were answered very well this year. To me it feel like the early days of computing. What was not possible one year became possible when a new generation CPU or GPU came out and you have to consistently re-evaluate your expectations or else you'll miss the things that others are discovering with fresh eyes.

I made this personal 'benchmark' of odd and strange questions a few years back when this took off and I would keep re-running these questions whenever some big news came out about a new model and also going back and fourth between the different companies to see where they all stood. (Obvioulsy with clean cache/new accounts)

10 questions: In 2023 it could only get past question 3-4 to reaching the last question and still hacillunating(last year) to providing sources pulled from really obscure books(this year).

For example, one of the harder questions was about the transition of a particular 30 second portion of a background song used in a 30+ year old Bond film that was only played once in the entire film. Went from totally making up nonsense to accurately describing the music theory defintiion of the transition(called a 'stinger') to also explaining why it was done in that particular scene of the film and also providing sources from a snippet of a unrelated interview with the composer explaining his mindset at the time.

Maybe this isn't considered a real benchmark as its not reproducable but for a 'personal benchmark' I came away impressed. I would consider everyone to define their own benchmarks and 'tests' and to consistantly challenge the models to see if there are any meaningful improvements. Now I treat the AI as something to keep skeptical but to also to always consider what it proposes as an answer(ie. dont ever dismiss it outright). I sometimes wonder if this is slowly messing up my biases and maybe thats what Altman, Amodei and others want.