Remix.run Logo
burakemir 2 hours ago

Take this with a grain of salt as I am new to this but IMHO for establishing memory hierarchy once and for all, it would be more helpful to present some abstract theory that

* Explains prefill (time to first token TTFT) vs decode (time between tokens TBT aka 1/tps)

* The various ways to schedule the computation, and the roles of runtime vs driver

* The scenarios and choices, taking into account traffic patterns, whether you are an inference service or doing batch or claw whatnot.