Why and what would a good benchmark look like?
30 people trying out all models on the list for their use case for a week and then checking what they're still using a month after.