I'm a little more optimistic than that. I suspect that the open-weight models we already have are going to be enough to support incremental development of new ones, using reasonably-accessible levels of compute.

The idea that every new foundation model needs to be pretrained from scratch, using warehouses of GPUs to crunch the same 50 terabytes of data from the same original dumps of Common Crawl and various Russian pirate sites, is hard to justify on an intuitive basis. I think the hard work has already been done. We just don't know how to leverage it properly yet.

▲

thesz 6 hours ago | parent | next [-]

Change layer size and you have to retrain. Change number of layers and you have to retrain. Change tokenization and you have to retrain.

▲

dTal 5 hours ago | parent | next [-]

None of that is true, at least in theory. You can trivially change layer size simply by adding extra columns initialized as 0, effectively embedding your smaller network in a larger network. You can add layers in a similar way, and in fact LLMs are surprisingly robust to having layers added and removed - you can sometimes actually improve performance simply by duplicating some middle layers[0]. Tokenization is probably the hardest but all the layers between the first and last just encode embeddings; it's probably not impossible to retrain those while preserving the middle parts.

[0] https://news.ycombinator.com/item?id=47431671 https://news.ycombinator.com/item?id=47322887

	▲	thesz 3 hours ago \| parent \| next [-]
		You took a simple path, embedding smaller into larger. What if you need to reduce number of layers and/or width of hidden layers? How will you embed larger into smaller? As for the "addition of same layers" - would the process of "layers to add" selection be considered training? What if you still have to obtain the best result possible for given coefficient/tokenization budget? I think that my comment express general case, while yours provide some exceptions.
	▲	andriy_koval 3 hours ago \| parent \| prev [-]
		there is evidence it is useful in some cases, but obviously no evidence it is enough if you chase to beat SOTA.

▲

altruios 5 hours ago | parent | prev | next [-]

Hopefully we will find a way to make it so that making minor changes don't require a full retrain. Training how to train, as a concept, comes to mind.

▲

CamperBob2 5 hours ago | parent | prev [-]

And yet the KL divergence after changing all that stuff remains remarkably similar between different models, regardless of the specific hyperparameters and block diagrams employed at pretraining time. Some choices are better, some worse, but they all succeed at the game of next-token prediction to a similar extent.

To me, that suggests that transformer pretraining creates some underlying structure or geometry that hasn't yet been fully appreciated, and that may be more reusable than people think.

Ultimately, I also doubt that the model weights are going to turn out to be all that important. Not compared to the toolchains as a whole.

	▲	thesz 3 hours ago \| parent [-]
		That "underappreciated underlying structure or geometry" can be just an artifact of the same tokenization used with different models. Tokenization breaks up collocations and creates new ones that are not always present in the original text as it was. Most probably, the first byte pair found by simple byte pair encoding algorithm in enwik9 will be two spaces next to each other. Is this a true collocation? BPE thinks so. Humans may disagree. What does concern me here is that it is very hard to ablate tokenization artifacts.

▲

pduggishetti 6 hours ago | parent | prev [-]

I do not think it's common crawl anymore, its common crawl++ using paid human experts to generate and verify new content, weather its code or research.

I believe US is building this off the cost difference from other countries using companies like scale, outlier etc, while china has the internal population to do this