Show HN: I built a 2nd-order PyTorch optimizer for LLMs that runs on 16GB GPUs

	▲	Show HN: I built a 2nd-order PyTorch optimizer for LLMs that runs on 16GB GPUs
		2 points by dnosoz 5 hours ago \| 3 comments
		Hi HN, I'm Danilo. I've been struggling with the limitations of AdamW when fine-tuning LLMs locally. Second-order optimizers (like Shampoo or SOAP) offer significantly better step-convergence by exploiting Kronecker-factored curvature. The problem? They require O(d^2) memory and O(d^3) compute per layer, which immediately OOMs consumer hardware like a 16GB T4 or RTX 3090. I wanted Shampoo-quality preconditioning on my home setup, so I built SCAO (Sparse Curvature-Aware Optimizer). It's a PyTorch optimizer that acts as a drop-in replacement for AdamW, but it implements a few strict architectural changes to survive on consumer cards: 1. Adaptive Rank Selection: Instead of full-rank Kronecker factors, it truncates the eigenspace to retain >=95% of spectral mass. 2. Int8 EMA Quantization: The curvature accumulators are stored in symmetric int8, which yields a 4x memory reduction with zero degradation in perplexity. 3. Quantization Stability: Standard Shampoo usually crashes at step 1 during 4-bit QLoRA fine-tuning due to SVD ill-conditioning in quantized spaces. SCAO exploits sparse approximations to bypass this. 4. Fused CUDA kernels: I wrote custom kernels to fix an O(k * m^2 * n) complexity bottleneck in the naive projection implementation. The Benchmark: I recently ran a head-to-head benchmark on a single T4 (16GB VRAM) fine-tuning Qwen2.5-3B (4-bit QLoRA, rank 16): - Shampoo: Failed at Step 1 (SVD mathematical collapse). - SCAO: 100% stability, peaked at exactly 7.14 GB VRAM, with a smooth loss descent. It is pip-installable (pip install scao). I've written a technical report detailing the regret bounds, ablation studies, and scaling laws (published on Zenodo), but I really wanted to get this community's eyes on the CUDA kernels and the PyTorch implementation. GitHub: https://github.com/whispering3/scao Technical Report (DOI): https://doi.org/10.5281/zenodo.19870556 I'd love any feedback, code roasts, or questions about the math behind it!
	▲	dnosoz 5 hours ago \| parent \| next [-]
		Author here. Happy to answer any deep-dive questions about the CUDA implementation or the Kronecker factorization math.
	▲	satvikpendem 5 hours ago \| parent \| prev \| next [-]
		Your account is shadow banned by the way, I guess you've just been self promoting too much.
	▲	lostmsu 4 hours ago \| parent \| prev [-]
		Does it actually improve time to target loss?