| ▲ | upupupandaway 4 hours ago | |
Any data on performance? | ||
| ▲ | carlovalenti 3 hours ago | parent [-] | |
Good question, I hope the answer doesn't disappoint you too much: 1. I made no benchmark comparisons to other existing projects 2. TRiP is CPU-only 3. the matmul kernel is not hand-optimized. I have some experience in this, and made several attempts, but could not achieve significant improvement against simple gcc13 -Ofast, so I decide to leave it readable, and just moved forward. The only optimization hint left is probably the directive to align allocs to the size of the cache line. I considered adding flash attention, but CPU memory hierarchy does not benefit at the same level of GPUs. I considered (shortly) using optimized libraries, but actually I got bad results, and still - that was not my main focus (learning the transformer architecture in the details). This does not mean that TRiP is horribly slow! Keeping the kernel straight-forward, plus the alignment thing, should help the optimizer to make its fair use of unrolls, strides and vectorization. If you have any suggestion to improve it (and for sure there's room for improvement), I'd be glad to get it, and if this does not complicate the things up to the point of messing up with the educational purpose, I could put it in! Thank you for your interest! | ||