| ▲ | alecco 4 hours ago | |
Yeah, not a great apples-to-apples comparison. I think the point stands: MoE, a myriad of complex attention approaches, shared layers, you name it. And making it all work together well is a huge trial-and-error pain even for small models, never mind getting to efficient hardware utilization. | ||