| ▲ | burakemir 2 hours ago | |
Take this with a grain of salt as I am new to this but IMHO for establishing memory hierarchy once and for all, it would be more helpful to present some abstract theory that * Explains prefill (time to first token TTFT) vs decode (time between tokens TBT aka 1/tps) * The various ways to schedule the computation, and the roles of runtime vs driver * The scenarios and choices, taking into account traffic patterns, whether you are an inference service or doing batch or claw whatnot. | ||