Remix.run Logo
throwup238 3 hours ago

There is no doubt trillions of tokens of general communication in all kinds of languages tucked away in national archives and private collections.

The National Archives of Spain alone have 350 million pages of documents going back to the 15th century, ranging from correspondence to testimony to charts and maps, but only 10% of it is digitized and a much smaller fraction is transcribed. Hopefully with how good LLMs are getting they can accelerate the transcription process and open up all of our historical documents as a huge historical LLM dataset.