Remix.run Logo
philipportner 2 hours ago

This seems related, it may not be a codebase but they are able to extract "near" verbatim books out of Claude Sonnet.

https://arxiv.org/pdf/2601.02671

> For Claude 3.7 Sonnet, we were able to extract four whole books near-verbatim, including two books under copyright in the U.S.: Harry Potter and the Sorcerer’s Stone and 1984 (Section 4).

Aurornis an hour ago | parent [-]

Their technique really stretched the definition of extracting text from the LLM.

They used a lot of different techniques to prompt with actual text from the book, then asked the LLM to continue the sentences. I only skimmed the paper but it looks like there was a lot of iteration and repetitive trials. If the LLM successfully guessed words that followed their seed, they counted that as "extraction". They had to put in a lot of the actual text to get any words back out, though. The LLM was following the style and clues in the text.

You can't literally get an LLM to give you books verbatim. These techniques always involve a lot of prompting and continuation games.