| ▲ | modeless 3 days ago | |
They are trained on exactly the same data in the same order with the same optimizer because they are literally the same base model. With a little fine tuning added on top. I see now that they did one experiment with trained from scratch models. They trained five Resnet-50s on five disjoint datasets of natural images, most quite small. And IIUC they were able to, without further training, combine them into one "universal" model that can be adapted to have only somewhat worse performance on any one of the five datasets (actually one of them is pretty bad) using only ~35 adaptation parameters. Which is kind of cool I guess but I also don't find it that surprising? I don't expect that you'd get the same finding at large scale in LLMs trained from scratch on disjoint and dissimilar data with different optimizers etc. I would find that surprising. But it would be very expensive to do that experiment so I understand why they weren't able to. | ||
| ▲ | mlpro 2 days ago | parent [-] | |
They are not trained on the same data. Even a skim of the paper shows very disjoint data. The LLMs are finetuned on very disjoint data. I checked some are on Chinese and other are for Math. The pretrained model provides a good initialization. I'm convinced. | ||