I've only noticed that combination (failure of short everyday tasks from SOTA models) on image comprehension, not text.

So some model will misclassify my American black nightshade* weeds as a tomato, but I get consistently OK results for text out from good models unless it's a trick question.

* I recon, at least; looked like this to me: https://en.wikipedia.org/wiki/Solanum_americanum#/media/File...

▲

iLoveOncall 13 hours ago | parent [-]

The research from Metr, and my comment, is exclusively related to software development tasks.

	▲	ben_w 11 hours ago \| parent [-]
		Re-reading my comment, I realise I missed the most important part, the question. What examples can you give of "real world situations" where they fail? Obviously I don't want to use them for whatever that is.