Remix.run Logo
jaydepun 2 hours ago

We've thought of doing this sort of exercise at work but mostly hit the wall of data becoming a lot more scare the further back in time we go. Particularly high quality science data - even going pre 1970 (and that's already a stretch) you lose a lot of information. There's a triple whammy of data still existing, being accessible in any format, and that format being suitable for training an LLM. Then there's the complications of wanting additional model capabilities that won't leak data causally.