Remix.run Logo
mmooss 15 hours ago

On what data is it trained?

On one hand it says it's trained on,

> 80B tokens of historical data up to knowledge-cutoffs ∈ 1913, 1929, 1933, 1939, 1946, using a curated dataset of 600B tokens of time-stamped text.

Literally that includes Homer, the oldest Chinese texts, Sanskrit, Egyptian, etc., up to 1913. Even if limited to European texts (all examples are about Europe), it would include the ancient Greeks, Romans, etc., Scholastics, Charlemagne, .... all up to present day.

But they seem to say it represents the 1913 viewpoint:

On one hand, they say it represents the perspective of 1913; for example,

> Imagine you could interview thousands of educated individuals from 1913—readers of newspapers, novels, and political treatises—about their views on peace, progress, gender roles, or empire.

> When you ask Ranke-4B-1913 about "the gravest dangers to peace," it responds from the perspective of 1913—identifying Balkan tensions or Austro-German ambitions—because that's what the newspapers and books from the period up to 1913 discussed.

People in 1913 of course would be heavily biased toward recent information. Otherwise, the greatest threat to peace might be Hannibal or Napolean or Viking coastal raids or Holy Wars. How do they accomplish a 1913 perspective?

zozbot234 15 hours ago | parent [-]

They apparently pre-train with all data up to 1900 and then fine-tune with 1900-1913 data. Anyway, the amount of available content tends to increase quickly over time, as instances of content like mass literature, periodicals, newspapers etc. only really became a thing throughout the 19th and early 20th century.

mmooss 15 hours ago | parent [-]

They pre-train with all data up to 1900 and then fine-tune with 1900-1913 data.

Where does it say that? I tried to find more detail. Thanks.

tootyskooty 15 hours ago | parent [-]

See pretraining section of the prerelease_notes.md:

https://github.com/DGoettlich/history-llms/blob/main/ranke-4...

pests 14 hours ago | parent [-]

I was curious, they train a 1900 base model, then fine tune to the exact year:

"To keep training expenses down, we train one checkpoint on data up to 1900, then continuously pretrain further checkpoints on 20B tokens of data 1900-${cutoff}$. "