| ▲ | kqr 2 hours ago | |||||||
I have a hypothesis email scrapers don't parse HTML at all. I suspect they search the raw bytestring for @ characters and take whatever's on either side of it. That probably gets them as many addresses as they can realistically use at a fraction of the cost, given how expensive HTML parsing can be. (Similarly, I'm sure most links can be found by searching the bytestring for "href" and taking what's to the right of it.) This would explain why HTML entities are so effective. On the other hand, surely the TLS handshake is far more expensive than HTML parsing? Maybe it's to avoid parser failure modes that consume a lot of resources? | ||||||||
| ▲ | BorisMelnik an hour ago | parent | next [-] | |||||||
it really varies, you are correct most modern ones search the byte string for @ characters but there are probably hundreds of different methods out there in black hat marketing circles to scrape emails. | ||||||||
| ||||||||
| ▲ | j45 2 hours ago | parent | prev [-] | |||||||
Token based extraction around the @ is definitely one way that can work with a few tweaks. | ||||||||