| ▲ | Show HN: WebGPU LLM inference comprehensive benchmark(arxiv.org) |
| 2 points by yu3zhou4 9 hours ago | 2 comments |
| |
|
| ▲ | emanuele-em 7 hours ago | parent [-] |
| The finding that naive single-op benchmarks overestimate dispatch cost by ~20x is wild. Curious how much the torch-webgpu backend could close the gap with CUDA if you went aggressive on kernel fusion, 53% improvement on Vulkan already is significant. Any plans to try wgsl-level custom kernels? |
| |
| ▲ | yu3zhou4 7 hours ago | parent [-] | | Honestly there is a lot for room of improvement in torch-webgpu for performance. Needs involvement of community but the opportunities are definitely there |
|