| ▲ | lm28469 2 hours ago | |
> the issue is there is very little text before the internet, Hm there is a lot of text from before the internet, but most of it is not on internet. There is a weird gap in some circles because of that, people are rediscovering work from pre 1980s researchers that only exist in books that have never been re-edited and that virtually no one knows about. | ||
| ▲ | throwup238 an hour ago | parent [-] | |
There is no doubt trillions of tokens of general communication in all kinds of languages tucked away in national archives and private collections. The National Archives of Spain alone have 350 million pages of documents going back to the 15th century, ranging from correspondence to testimony to charts and maps, but only 10% of it is digitized and a much smaller fraction is transcribed. Hopefully with how good LLMs are getting they can accelerate the transcription process and open up all of our historical documents as a huge historical LLM dataset. | ||