| ▲ | drivebyhooting 3 hours ago | |
I’m interested in learning more about your theory that these models can be trained more cheaply. Is anyone doing it from scratch, rather than adversarial distillation? | ||
| ▲ | 2ndorderthought 3 hours ago | parent | next [-] | |
It is a lot cheaper to train a 27b model such as qwen3.6 which you can even vibe code or agentic code with than it is to train a 1t+ parameter model. It runs on a single commodity GPU for goodness sake It's not a theory. These smaller models that are coming out are huge advances for the field. I can't comment on companies training practices. That would be proprietary stuff I guess. I think the claims that the advances being made are due to distillation alone are completely unfair. The advances alone are not just data. | ||
| ▲ | freeone3000 2 hours ago | parent | prev [-] | |
It almost doesn’t matter if it’s trained using adversarial distillation - if it’s nearly as good, and one-hundredth the cost, the choice is obvious. | ||