Good news! Today's SOTA models can also make things go badly.
Yep. I don’t see how that metric indicates how… strong(?) a language model is.