| ▲ | RyanCavanaugh 4 hours ago | |||||||
The internet is hundreds of billions of terabytes; a frontier model is maybe half a terabyte. While they are certainly capable of doing some verbatim recitations, this isn't just a matter of teasing out the compressed C compiler written in Rust that's already on the internet (where?) and stored inside the model. | ||||||||
| ▲ | philipportner 2 hours ago | parent | next [-] | |||||||
This seems related, it may not be a codebase but they are able to extract "near" verbatim books out of Claude Sonnet. https://arxiv.org/pdf/2601.02671 > For Claude 3.7 Sonnet, we were able to extract four whole books near-verbatim, including two books under copyright in the U.S.: Harry Potter and the Sorcerer’s Stone and 1984 (Section 4). | ||||||||
| ||||||||
| ▲ | seba_dos1 2 hours ago | parent | prev | next [-] | |||||||
> The internet is hundreds of billions of terabytes; a frontier model is maybe half a terabyte. The lesson here is that the Internet compresses pretty well. | ||||||||
| ▲ | uywykjdskn 22 minutes ago | parent | prev | next [-] | |||||||
You got a source on frontier models being maybe half a terabyte. That's not passing the sniff test. | ||||||||
| ▲ | mft_ 2 hours ago | parent | prev [-] | |||||||
(I'm not needlessly nitpicking, as I think it matters for this discussion) A frontier model (e.g. latest Gemini, Gpt) is likely several-to-many times larger than 500GB. Even Deepseek v3 was around 700GB. But your overall point still stands, regardless. | ||||||||