| ▲ | eldenring 3 hours ago | |
This is a common way of thinking. In practice this type of thing is more like optimizing flop allocation. Surely with an infinite compute and parameter budget you could have a better model with more intensive operations. Another thing to consider is that transformers are very general computers. You can encode many many more complex architectures in simpler, multi layer transformers. | ||