Remix clone Hacker News

new | show | ask | jobs Github

	▲	drivebyhooting 3 hours ago
		I’m interested in learning more about your theory that these models can be trained more cheaply. Is anyone doing it from scratch, rather than adversarial distillation?
	▲	2ndorderthought 3 hours ago \| parent \| next [-]
		It is a lot cheaper to train a 27b model such as qwen3.6 which you can even vibe code or agentic code with than it is to train a 1t+ parameter model. It runs on a single commodity GPU for goodness sake It's not a theory. These smaller models that are coming out are huge advances for the field. I can't comment on companies training practices. That would be proprietary stuff I guess. I think the claims that the advances being made are due to distillation alone are completely unfair. The advances alone are not just data.
	▲	freeone3000 2 hours ago \| parent \| prev [-]
		It almost doesn’t matter if it’s trained using adversarial distillation - if it’s nearly as good, and one-hundredth the cost, the choice is obvious.