The sample responses given are fascinating. It seems more difficult than normal to even tell that they were generated by an LLM, since most of us (terminally online) people have been training our brains' AI-generated text detection on output from models trained with a recent cutoff date. Some of the sample responses seem so unlike anything an LLM would say, obviously due to its apparent beliefs on certain concepts, though also perhaps less obviously due to its word choice and sentence structure making the responses feel slightly 'old-fashioned'.

▲

libraryofbabel 16 hours ago | parent | next [-]

I used to teach 19th-century history, and the responses definitely sound like a Victorian-era writer. And they of course sound like writing (books and periodicals etc) rather than "chat": as other responders allude to, the fine-tuning or RL process for making them good at conversation was presumably quite different from what is used for most chatbots, and they're leaning very heavily into the pre-training texts. We don't have any living Victorians to RLHF on: we just have what they wrote.

To go a little deeper on the idea of 19th-century "chat": I did a PhD on this period and yet I would be hard-pushed to tell you what actual 19th-century conversations were like. There are plenty of literary depictions of conversation from the 19th century of presumably varying levels of accuracy, but we don't really have great direct historical sources of everyday human conversations until sound recording technology got good in the 20th century. Even good 19th-century transcripts of actual human speech tend to be from formal things like court testimony or parliamentary speeches, not everyday interactions. The vast majority of human communication in the premodern past was the spoken word, and it's almost all invisible in the historical sources.

Anyway, this is a really interesting project, and I'm looking forward to trying the models out myself!

▲

nemomarx 15 hours ago | parent | next [-]

I wonder if the historical format you might want to look at for "Chat" is letters? Definitely wordier segments, but it's at least the back and forth feel and we often have complete correspondence over long stretches from certain figures.

This would probably get easier towards the start of the 20th century ofc

	▲	libraryofbabel 15 hours ago \| parent [-]
		Good point, informal letters might actually be a better source - AI chat is (usually) a written rather than spoken interaction after all! And we do have a lot transcribed collections of letters to train on, although they’re mostly from people who were famous or became famous, which certainly introduces some bias.

▲

dleeftink 15 hours ago | parent | prev | next [-]

While not specifically Victorian, couldn't we learn much from what daily conversations were like by looking at surviving oral cultures, or other relatively secluded communal pockets? I'd also say time and progress are not always equally distributed, and even within geographical regions (as the U.K.) there are likely large differences in the rate of language shifts since then, some possibly surviving well into the 20th century.

▲

NooneAtAll3 8 hours ago | parent | prev | next [-]

don't we have parlament transcripts? I remember something about Germany (or maybe even Prussia) developing fast script to preserve 1-to-1 what was said

▲

bryancoxwell 14 hours ago | parent | prev [-]

Fascinating, thanks for sharing

▲

_--__--__ 16 hours ago | parent | prev | next [-]

The time cutoff probably matters but maybe not as much as the lack of human finetuning from places like Nigeria with somewhat foreign styles of English. I'm not really sure if there is as much of an 'obvious LLM text style' in other languages, it hasn't seemed that way in my limited attempts to speak to LLMs in languages I'm studying.

▲

d3m0t3p 16 hours ago | parent | next [-]

The model is fined tuned for chat behavior. So the style might be due to - Fine tuning - More Stylised text in the corpus, english evolved a lot in the last century.

	▲	paul_h 8 hours ago \| parent [-]
		Diverged as well as standardized. I did some research into "out of pocket" and how it differs in meaning in UK-English (paying from one's own funds) and American-English (uncontactable) and I recall 1908 being the current thought as to when the divergence happened: 1908 short story by O. Henry titled "Buried Treasure."

▲

anonymous908213 16 hours ago | parent | prev [-]

There is. I have observed it in both Chinese and Japanese.

▲

kccqzy 11 hours ago | parent | prev | next [-]

Oh definitely. One thing that immediately caught my mind is that the question asks the model about “homosexual men” but the model starts the response with “the homosexual man” instead. Changing the plural to the singular and then adding an article. Feels very old fashioned to me.

▲

tonymet 15 hours ago | parent | prev [-]

the samples push the boundaries of a commercial AI, but still seem tame / milquetoast compared to common opinions of that era. And the prose doesn't compare. Something is off.