Remix.run Logo
ben_w 17 hours ago

I've only noticed that combination (failure of short everyday tasks from SOTA models) on image comprehension, not text.

So some model will misclassify my American black nightshade* weeds as a tomato, but I get consistently OK results for text out from good models unless it's a trick question.

* I recon, at least; looked like this to me: https://en.wikipedia.org/wiki/Solanum_americanum#/media/File...

iLoveOncall 13 hours ago | parent [-]

The research from Metr, and my comment, is exclusively related to software development tasks.

ben_w 11 hours ago | parent [-]

Re-reading my comment, I realise I missed the most important part, the question.

What examples can you give of "real world situations" where they fail?

Obviously I don't want to use them for whatever that is.