Remix.run Logo
immibis a day ago

My issue with Gitea (which Forgejo is a fork of) was that crawlers would hit the "download repository as zip" link over and over. Each access creates a new zip file on disk which is never cleaned up. I disabled that (by setting the temporary zip directory to read-only, so the feature won't work) and haven't had a problem since then.

It's easy to assume "I received a lot of requests, therefore the problem is too many requests" but you can successfully handle many requests.

This is a clever way of doing a minimally invasive botwall though - I like it.

userbinator 11 hours ago | parent | next [-]

Each access creates a new zip file on disk which is never cleaned up.

That sounds like a bug.

isodev 7 hours ago | parent [-]

I think that’s been fixed in Forgejo a long time ago

bob1029 12 hours ago | parent | prev [-]

> you can successfully handle many requests.

There is a point where your web server becomes fast enough that the scraping problem becomes irrelevant. Especially at the scale of a self-hosted forge with a constrained audience. I find this to be a much easier path.

I wish we could find a way to not conflate the intellectual property concerns with the technological performance concerns. It seems like this is essential to keeping the AI scraping drama going in many ways. We can definitely make the self hosted git forge so fast that anything short of ~a federal crime would have no meaningful effect.

idontsee 11 hours ago | parent | next [-]

> There is a point where your web server becomes fast enough that the scraping problem becomes irrelevant.

It isn't just the volume of requests, but also bandwidth. There have been cases where scraping represents >80% of a forge's bandwidth usage. I wouldn't want that to happen to the one I host at home.

spockz 11 hours ago | parent | prev [-]

Maybe it is fast enough but my objection is mostly due to the gross inefficiency of crawlers. Requesting downloads of whole repositories over and over, leading to storing these archives on disk wasting CPU cycles to create them and storage space to retain them, and bandwidth to sent them over the wire. Add this to the gross power consumption of AI and hogging of physical compute hardware, and it is easy to see “AI” as wasteful.