Remix.run Logo
fulafel 8 hours ago

The marketing subterfugue might be about this exactly, technically prompt processing means the prefill phase of inference. So prompt goes in 4x as fast but generates tokens slower.

This seems even likely as the memory bandwidth hasn't increased enough for those kinds of speedups, and I guess prefill is more likely to be compute-bound (vs mem bw bound).

petercooper 4 hours ago | parent [-]

So prompt goes in 4x as fast but generates tokens slower.

I'd take that tradeoff. On my M3 Ultra, the inference is surprisingly fast, but the prompt processing speed makes it painful except as a fallback or experimentation, especially with agentic coding tools.