This is a really impressive piece of systems engineering. The 3-tier adaptive caching (VRAM resident > pinned RAM > NVMe/mmap) is essentially reimplementing what the Linux kernel's page cache does, but with GPU-awareness baked in.

The 0.3 tok/s for 70B Q4_K_M on a single 3090 is slow for interactive use, but the architecture itself is what matters here. PCIe Gen3 x8 at ~6.5 GB/s is the clear bottleneck - I'd be very curious to see numbers on a Gen5 NVMe setup where sequential reads can hit 12+ GB/s. That alone could potentially double throughput.

The layer skip via cosine similarity calibration (20/80 layers skipped) is a clever trick. Reminds me of early work on adaptive computation in transformers. The quality tradeoff at threshold 0.98 would be interesting to benchmark more rigorously - for many inference tasks like summarization or classification, you could probably push that much further.

Also worth noting: zero external dependencies beyond CUDA Toolkit is a bold design choice. No cuBLAS means they wrote their own GEMM kernels, which is a massive undertaking but gives full control over the memory access patterns needed for this streaming architecture.

▲

Aurornis 4 hours ago | parent | next [-]

> No cuBLAS means they wrote their own GEMM kernels, which is a massive undertaking

Not to diminish the impressiveness of this overall project, but it says right up front that these were vibe coded and the Opus 4.6 co-author lines are right in the commit messages. Those pieces were adapted from existing work via LLM, which is exactly the right use in a proof of concept project like this.

▲

snovv_crash 2 hours ago | parent | prev [-]

Please don't use LLMs to post on HN...

	▲	IshKebab an hour ago \| parent [-]
		Yeah I don't even get the motivation for that. Are HN accounts valuable in any way?