Remix clone Hacker News

new | show | ask | jobs Github

▲

gaeld 4 hours ago

Follow-up reading the most technical and research people here:

Monokernel deep dive (GPU Engineering): http://blog.kog.ai/building-a-single-kernel-latency-optimize...

Delayed Tensor Parallelism (research): http://blog.kog.ai/delayed-tensor-parallelism-for-faster-tra...

To try the speed on the playground: http://playground.kog.ai

▲

zozbot234 35 minutes ago | parent [-]

It looks like DTP is a distinct architectural choice that would require training new models accordingly? This wouldn't be able to just run inference for existing models.

	▲	gaeld 16 minutes ago \| parent [-]
		Totally, though DTP is not required for these kind of speeds. Standard TP works also. DTP is something we built for our roadmap in order to get to extremely high speeds (like 10k+ tokens/s). When the budget is under 10 µs per layer, any little overhead matters. For 1k to 5k tokens/s, regular TP still works because we are able to optimize the inter-GPU all-reduce collectives at under 3 µs, which allows to continue streaming model weights in shared memory, registers and caches while GPUs exchange data.