▲ | roflmaostc 6 days ago | ||||||||||||||||
What about old books? Wikipedia? Law texts? Programming languages documentations? How many tokens is a 100 pages PDF? 10k to 100k? | |||||||||||||||||
▲ | arvindh-manian 6 days ago | parent | next [-] | ||||||||||||||||
For reference, I think a common approximation is one token being 0.75 words. For a 100 page book, that translates to around 50,000 tokens. For 1 mil+ tokens, we need to be looking at 2000+ page books. That's pretty rare, even for documentation. It doesn't have to be text-based, though. I could see films and TV shows becoming increasingly important for long-context model training. | |||||||||||||||||
| |||||||||||||||||
▲ | jjmarr 6 days ago | parent | prev [-] | ||||||||||||||||
Wikipedia does not have many pages that are 750k words. According to Special:LongPages[1], the longest page right now is a little under 750k bytes. https://en.wikipedia.org/wiki/List_of_chiropterans Despite listing all presently known bats, the majority of "list of chiropterans" byte count is code that generates references to the IUCN Red List, not actual text. Most of Wikipedia's longest articles are code. |