Remix.run Logo
radq 4 hours ago

Appreciate you saying the blog was nice. Not sure what you mean by "CODEX fingerprints", but I'll engage with the other points. We work on small models, and our customers want real-time inference on modern GPUs. The sub-title says "near-realtime VLM inference". 20-30ms forward passes are a non-starter for these workloads.

If you scroll down to the section titled "A cost model for the bubble", you will find both benchmark results and us saying, "you get back anywhere from a few percent to a third; more the faster your accelerator/model is".

augment_me an hour ago | parent [-]

My comment is aimed to highlight that the "GPU Bubble" is frames as a general solution when it's not, its a specific bottleneck based on your model size. Your dont mention your model size anywhere, the reader has to infer it from the runtimes, and if they dont know the average forward pass of a model, well too bad, they will leave without understanding the actual trade-off.

The benchmarks you point to in the section titled "A cost model for the bubble" dont include any CPU overheads or the T_block-T_pipe you mention, they just give the improvement %.

In general, you answers here in the thread read as defensive and unhumble. They leave a sour taste of your company, you should consider how you engage with your audience.