In my experience once you get to ~30 gigs of ram for a model like Gemma4, the rest of the 128g of memory is simply nice to have. The speed and costs are what make it tough though, because its slower and more expensive than the same model served on a big accelerator card, and is going to be worse than a frontier model.

▲

digitaltrees 7 hours ago | parent [-]

I wonder if it really needs to be worse. I am playing with the idea of fine tuning a model on my exact stack and coding patterns. I suspect I could get better performance by training “taste” into a model rather than breadth.

▲

andy_ppp 3 hours ago | parent | next [-]

Fine tuning these models (at least with PPO or equivalent) requires even more VRAM than inference does, potentially 2-3 times more.

▲

epicureanideal 3 hours ago | parent | prev [-]

I also wonder about JS only, Python only, etc models.

Maybe the future is a selection of local, specific stack trained models?

	▲	andy_ppp 3 hours ago \| parent [-]
		These models being able to generalise at coding will likely get worse if you remove high quality training data like all of python.