As someone who works in the field, the blog is nice but it has a lot of CODEX fingerprints on it, and it's also very specific to the size of the model in question in a way that is not explicit from the blog until the very last section.

In general, for some reason CODEX loves CUDA-streams, it's the first optimization it goes for every time when writing GPU kernels. However in many cases this is not a bottleneck, it happens to be so here because the model in the blog is small (2.4ms FW-pass is tiny, and 9B params sit on a single GPU). Large models are closer to 30-40ms. The CPU-GPU sync is 1-2ms, when working on larger MoE models the scheduling of tokens in this way is much less important than for example scheduling of computation/communication or kernel optimization.

I wish the blog would state this at the start with the premise of what has been done, or show that this is indeed the bottleneck with some benchmarking. Otherwise is kind of overselling things imo.

▲

radq 4 hours ago | parent [-]

Appreciate you saying the blog was nice. Not sure what you mean by "CODEX fingerprints", but I'll engage with the other points. We work on small models, and our customers want real-time inference on modern GPUs. The sub-title says "near-realtime VLM inference". 20-30ms forward passes are a non-starter for these workloads.

If you scroll down to the section titled "A cost model for the bubble", you will find both benchmark results and us saying, "you get back anywhere from a few percent to a third; more the faster your accelerator/model is".

	▲	augment_me an hour ago \| parent [-]
		My comment is aimed to highlight that the "GPU Bubble" is frames as a general solution when it's not, its a specific bottleneck based on your model size. Your dont mention your model size anywhere, the reader has to infer it from the runtimes, and if they dont know the average forward pass of a model, well too bad, they will leave without understanding the actual trade-off. The benchmarks you point to in the section titled "A cost model for the bubble" dont include any CPU overheads or the T_block-T_pipe you mention, they just give the improvement %. In general, you answers here in the thread read as defensive and unhumble. They leave a sour taste of your company, you should consider how you engage with your audience.