Remix.run Logo
fastball 5 days ago

> According to Google, it’s possible to verify Googlebot by matching the crawler’s IP against a list of published Googlebot IPs. This is rather technical and highly intensive

Wat. Blocklisting IPs is not very technical (for someone running a website that knows + cares about crawling) and is definitely not intensive. Fetch IP list, add to blocklist. Repeat daily with cronjob.

Would take an LLM (heh) 10 seconds to write you the necessary script.

kinix 5 days ago | parent | next [-]

From how I read it, the author seems to be suggesting that this list of IPs be on an allowlist as they see Google as "less nefarious". As such, sure, allowing google IPs is as easy as allowing all IPs, but discerning who are "nefarious actors" is probably harder.

A more tounge-in-cheek point: all scripts take an LLM ~10 seconds to write, doesn't mean it's right though.

simonw 5 days ago | parent | prev [-]

Looks like OpenAI publish the IPs of their training data crawler too: https://openai.com/gptbot.json