Remix.run Logo
ninjin 2 hours ago

I can report that Facebook does not respect robots.txt. Heck, I even mailed domain@fb.com with the specific IP ranges and log samples three times over a month and they of did not even respond. Keeps on wasting my CPU cycles to this day by crawling massive development forks (I hope they choke on the data...):

    $ (cat /var/www/logs/access.log; zcat /var/www/logs/access.log*.gz) | grep 2a03:2880: | wc -l
    626396
About three hits per second for months now.
dylan604 an hour ago | parent | next [-]

Can you serve them a specific file that would make it expensive on their end?

ninjin an hour ago | parent [-]

If I had the time and energy, I would make some sort of simple code language model and generate infinite junk and feed that to them in the hope that it ruins their future training runs. But, I lack the former and some of the latter. Alternatively, maybe I would actually read one of those "backdoor papers" and try to inject something like that.

dylan604 an hour ago | parent [-]

I was wondering if this could be done without being malicious to that level. If they are costing you money, then I have no moral qualms playing in kind. Taking that next step would then give up the moral high ground and potentially introduce yourself to legally questionable grounds.

I get the lack of time/energy for this type of thing. It is one of those projects that could be satisfying for yourself, but very hard to justify if you're a family person but something a younger person might get a lot of pleasure from.

drcongo an hour ago | parent | prev [-]

I block their entire ASN when they do that.