| ▲ | xdavidliu 7 hours ago | ||||||||||||||||||||||
i don't understand what the second paragraph is saying. | |||||||||||||||||||||||
| ▲ | nine_k 6 hours ago | parent | next [-] | ||||||||||||||||||||||
In very crude terms, AFAICT, if you have a bunch of matrix multiplications, but one of matrices (the one with model weights) doesn't change, you can seriously speed up the computation. One thing is that you don't need to re-fetch the elements of the constant matrix, you can keep it near the ALUs. Then you maybe can detect and ignore sparse / empty blocks by marking them once. IDK how the custom hardware exploits this; would love to hear any ideas! | |||||||||||||||||||||||
| |||||||||||||||||||||||
| ▲ | otterdude 6 hours ago | parent | prev [-] | ||||||||||||||||||||||
Basically getting around the branch predictor problem with generalized compute architectures https://en.wikipedia.org/wiki/Branch_predictor | |||||||||||||||||||||||