| ▲ | eis 6 hours ago | |||||||
The Elo rating system measures relative performance to the other models. As the other models improve or rather newer better models enter the list, the Elo score of a given existing model will tend to decrease even though there might be no changes whatsoever to the model or its system prompt. You can't use Elo scores to measure decay of a models performance in absolute terms. For that you need a fixed harness running over a fixed set of tests. | ||||||||
| ▲ | bob1029 4 hours ago | parent | next [-] | |||||||
The relative and auto-scaling nature of Elo ranking feels like an advantage here. Relative ranking systems extract more information per tournament. You will get something approximating the actual latent skill level with enough of them. | ||||||||
| ||||||||
| ▲ | TurdF3rguson 3 hours ago | parent | prev [-] | |||||||
Is that strictly true? ELO rankings do also inflate over time (looking at you, Chess GMs) | ||||||||
| ||||||||