| ▲ | observationist 5 hours ago | |
I think what they're getting at is that for a given unit of compute, this method achieves 125% performance. If model A reaches performance level 100 using 100 units of compute using old methods, and you train model B using AttnRes, aiming at performance level 100, it costs you 80 units of compute. It probably doesn't map precisely, but that's where people are diverging from the claim - it doesn't explicitly say anything about reduced inference or training time, but that's the implicit value of these sorts of things. Less compute to equivalent performance can be a huge win for platforms at scale as well as for local models. | ||
| ▲ | dvt 5 hours ago | parent [-] | |
> I think what they're getting at is that for a given unit of compute, this method achieves 125% performance. This is not what they're getting at; I explained exactly what they're getting at. I mean, your equivalence of "loss" (what authors actually measured) and "performance" is just bizarre. We use benchmarks to measure performance, and the numbers there were like 1-5% better (apart from the GPQA-Diamond outlier). Do people even read these papers? | ||