Remix.run Logo
axoltl 7 hours ago

It'd run on a 5090 with 32GB of VRAM at fp8 quantization which is generally a very acceptable size/quality trade-off. (I run GLM-4.5-Air at 3b quantization!) The transformer architecture also lends itself quite well to having different layers of the model running in different places, so you can 'shard' the model across different compute nodes.