Remix.run Logo
cthalupa an hour ago

Not true. This is aimed squarely at the Strix Halo and Mac markets. It's basically just strictly better than the Strix, and it's not clear cut vs that Macs in any sort of blanket statement.

My M5 Max 128gb MBP decodes faster than one of my Sparks, but the Spark's prefill is so much faster it can often answer the same query before the mac's prefill is finished. If you have large prompts, low cacheability, etc., a spark might be a very good options.

Not to mention you get can get two sparks and the MBP will be 85%+ of the cost at half the RAM.

I'm kind of tempted to pick one up. Leave running big models to my dual dgx setup, and all the misc. random stuff on an rtx.

zozbot234 2 minutes ago | parent [-]

Prefill will be a huge deal if batched unattended inference of SOTA models (on consumer platforms) becomes viable, because at that point it's the main remaining bottleneck. If running 30 inferences together boosts your decode throughput to 3x (that's consistent with some very rough experiments, though these haven't even looked at trying to mask SSD offload latency just yet), that's a 10x in total decode time but a 30x in total prefill time, because prefill workloads are fully parallelized already on consumer platforms and don't benefit from batching much at all.