Remix clone Hacker News

new | show | ask | jobs Github

	▲	wongarsu 9 hours ago
		I'm trying to work towards that goal by training a model on mostly German science texts up to 1904 (before the world wars German was the lingua franca of most sciences). Training data for a base model isn't that hard to come by, even though you have to OCR most of it yourself because the publicly available OCRed versions are commonly unusably bad. But training a model large enough to be useful is a major issue. Training a 700M parameter model at home is very doable (and is what this TimeCapsuleLLM is), but to get that kind of reasoning you need something closer to a 70B model. Also a lot of the "smarts" of a model gets injected in fine tuning and RL, but any of the available fine tuning datasets would obviously contaminate the model with 2026 knowledge.
	▲	benbreen 7 hours ago \| parent \| next [-]
		I am a historian and am putting together a grant application for a somewhat similar project (different era and language though). Would you be open to discussing a collaboration? My email is bebreen [at] ucsc [dot] edu.
	▲	theallan 9 hours ago \| parent \| prev \| next [-]
		Can we follow along with your work / results somewhere?
	▲	9 hours ago \| parent \| prev [-]
		[deleted]