| ▲ | jryio 3 hours ago | |
That's correct - however as other commenters have noted. Doing this by hand is extremely challenging for human engineers working on tensor kernels. The expense calculation might be expense of improvement = (time taken per optimization step * cost of unit time ) / ( speedup - 1) The expensive heuristic function is saving wall time well also being cheaper in cost of unit time. And as the paper shows the speed up provided for each unit time multiplied by unit cost of time is large. | ||
| ▲ | greeravoctado 2 hours ago | parent [-] | |
Usually the rate of overall improvement for this type of optimization is less than Moore law rate of improvement, thus not worth the company investment. 17x micro-benchmarks don't count. Real improvements come from architectural changes, for example: MoE, speculative multi-token prediction, etc. | ||