| ▲ | chongli 2 hours ago | |
I don't want to go too far down the conspiracy rabbit hole, but the vendors know everyone's prompts so it would be trivial for them to track the trackers and spoof the results. We already know that they substitute different models as a cost-saving measure, so substituting models to fool the repeated evaluations would be trivial. We also already know that they actively seek out viral examples of poor performance on certain prompts (e.g. counting Rs in strawberry) and then monkey-patch them out with targeted training. How can we be sure they're not trying to spoof researchers who are tracking model performance? Heck, they might as well just call it "regression testing." If their whole gig is an "emperor's new clothes" bubble situation, then we can expect them to try to uphold the masquerade as long as possible. | ||