I'm surprised you can do this with a relatively modest corpus of text (compared to the petabytes you can vacuum up from modern books, Wikipedia, and random websites). But if it works, that's actually fantastic, because it lets you answer some interesting questions about LLMs being able to make new discoveries or transcend the training set in other ways. Forget relativity: can an LLM trained on this data notice any inconsistencies in its scientific knowledge, devise experiments that challenge them, and then interpret the results? Can it intuit about the halting problem? Theorize about the structure of the atom?...

Of course, if it fails, the counterpoint will be "you just need more training data", but still - I would love to play with this.

▲

andy99 5 hours ago | parent | next [-]

The chinchilla paper says the “optimal” training data set size is about 20x the number of parameters (in tokens), see table 3: https://arxiv.org/pdf/2203.15556

Here they do 80B tokens for a 4B model.

	▲	EvgeniyZh 38 minutes ago \| parent [-]
		It's worth noting that this is "compute-bound optimal", i.e., given fixed compute, the optimal choice is 20:1. Under Chinchilla model the larger model always performs better than the small one if trained on the same amount of data. I'm not sure if it is true empirically, and probably 1-10B is a good guess for how large the model trained on 80B tokens should be. Similarly, the small models continue to improve beyond 20:1 ratio, and current models are trained on much more data. You could train a better performing model using the same compute, but it would be larger which is not always desirable.

▲

Aerolfos 4 hours ago | parent | prev [-]

> https://github.com/DGoettlich/history-llms/blob/main/ranke-4...

Given the training notes, it seems like you can't get the performance they give examples of?

I'm not sure about the exact details but there is some kind of targetted distillation of GPT-5 involved to try and get more conversational text and better performance. Which seems a bit iffy to me.