Remix.run Logo
davesque 44 minutes ago

> The more fundamental bottleneck is not even the frontier models, it's the datacenters.

Is it even though? Quantization and speculative decoding are improving the local AI story by leaps and bounds every month.

zozbot234 41 minutes ago | parent [-]

Speculative decoding is not that useful at scale, it's mostly about making local single-user inference faster. When you're batching multiple inferences together, that's already as fast as the verification you have to perform w/ speculative decoding.