▲ | Vegenoid 6 days ago | |
While I largely agree with you, more abstract judgements must be made as the capabilities (and therefore tasks being completed) become increasingly general. Attempts to boil human intellectual capability down to "X performance on Y task according to Z eval" can be useful, but are famously incomplete and insufficient on their own for making good decisions about which humans (a.k.a. which general intelligences) are useful and how to utilize and improve them. Boiling down highly complex behavior into a small number of metrics loses a lot of detail. There is also the desire to discover why a model that outperforms others does so, so that the successful technique can be refined and applied elsewhere. This too usually requires more approaches than metric comparison. |