| ▲ | wongarsu 9 hours ago | |
I'm trying to work towards that goal by training a model on mostly German science texts up to 1904 (before the world wars German was the lingua franca of most sciences). Training data for a base model isn't that hard to come by, even though you have to OCR most of it yourself because the publicly available OCRed versions are commonly unusably bad. But training a model large enough to be useful is a major issue. Training a 700M parameter model at home is very doable (and is what this TimeCapsuleLLM is), but to get that kind of reasoning you need something closer to a 70B model. Also a lot of the "smarts" of a model gets injected in fine tuning and RL, but any of the available fine tuning datasets would obviously contaminate the model with 2026 knowledge. | ||
| ▲ | benbreen 7 hours ago | parent | next [-] | |
I am a historian and am putting together a grant application for a somewhat similar project (different era and language though). Would you be open to discussing a collaboration? My email is bebreen [at] ucsc [dot] edu. | ||
| ▲ | theallan 9 hours ago | parent | prev | next [-] | |
Can we follow along with your work / results somewhere? | ||
| ▲ | 9 hours ago | parent | prev [-] | |
| [deleted] | ||