Remix.run Logo
mich5632 8 days ago

I think this the difference between compute bound pre-fill (a cpu has a high bandwidth/compute ratio), vs decode. The time to first token is below 0.5s - even for a 10k context.