| ▲ | zhyder 5 days ago | |
Glad to see big improvement in the SimpleQA Verified benchmark (28->69%), which is meant to measure factuality (built-in, i.e. without adding grounding resources). That's one benchmark where all models seemed to have low scores until recently. Can't wait to see a model go over 90%... then will be years till the competition is over number of 9s in such a factuality benchmark, but that'd be glorious. | ||
| ▲ | jug 5 days ago | parent [-] | |
Yes, that's very good because it's my main use case for Flash; queries depending on world knowledge. Not science or engineering problems, but think you'd ask someone that has a really broad knowledge about things and can give quick and straightforward answers. | ||