Remix.run Logo
kqr 2 hours ago

I have a hypothesis email scrapers don't parse HTML at all. I suspect they search the raw bytestring for @ characters and take whatever's on either side of it. That probably gets them as many addresses as they can realistically use at a fraction of the cost, given how expensive HTML parsing can be.

(Similarly, I'm sure most links can be found by searching the bytestring for "href" and taking what's to the right of it.)

This would explain why HTML entities are so effective.

On the other hand, surely the TLS handshake is far more expensive than HTML parsing? Maybe it's to avoid parser failure modes that consume a lot of resources?

BorisMelnik an hour ago | parent | next [-]

it really varies, you are correct most modern ones search the byte string for @ characters but there are probably hundreds of different methods out there in black hat marketing circles to scrape emails.

mcmcmc an hour ago | parent [-]

Haven’t heard “black hat marketing” before but that’s very fitting for a lot of the “growth hackers” out there

j45 2 hours ago | parent | prev [-]

Token based extraction around the @ is definitely one way that can work with a few tweaks.