| ▲ | MattSayar 6 hours ago | |
I recognize the sarcasm. The data I can find says it's performing at baseline however? | ||
| ▲ | ACCount37 6 hours ago | parent [-] | |
Yeah, that's my point. Humans are not reliable LLM evaluators. "Secret model nerfs" happen in "vibes" far more often than they do in any reality. | ||