Remix.run Logo
e12e 4 hours ago

Odd, I'd imagine Wikisource (in many/all languages) would be part of training data for all LLMs with SOTA ambition?

https://no.wikisource.org/wiki/De_knyttede_n%C3%A6ver

vidarh 4 hours ago | parent [-]

You'd think so. It seems like there are a lot of odd gaps like that.

I also have a favourite English language PhD thesis I ask every new model about that they still struggle to find even though there's a Wikipedia article about it that links a blog post I wrote about it.

Anyone who thinks they've exhausted even publicly crawlable resources should ask them about some obscure stuff.

thatcat 6 minutes ago | parent | next [-]

the models don't retain their full training data set

mistrial9 24 minutes ago | parent | prev [-]

you might be surprised if you take this approach.. give key words and phrases in small amounts, each sentence of a prompt building on a previous sentence. Take a an example that is not very hard, like Lewis Carrol Alice in Wonderland original text. Although a quick question might get things sort of wrong, or miss details, if you guide the LLM to a certain part of the story, then a certain set of characters in that part of the story, then a certain statement or dramatic moment with those characters in that part of the story, you might get very specific detail that is close to line-by-line accurate. On the other hand, if you ask a quick, ordinary question about the same part of the story without supplying context and character names, you get something equally vague. YMMV