Remix.run Logo
Gigachad 11 hours ago

Sounds like someone people are downloading it in their CI pipelines. Probably unknowingly. This is why most services stopped allowing automated downloads for unauthenticated users.

Make people sign up if they want a url they can `curl` and then either block or charge users who download too much.

userbinator 11 hours ago | parent | next [-]

I'd consider CI one of the worst massive wastes of computing resources invented, although I don't see how map data would be subject to the same sort of abusive downloading as libraries or other code.

Gigachad 11 hours ago | parent | next [-]

This stuff tends to happen by accident. Some org has an app that automatically downloads the dataset if it's missing, helpful for local development. Then it gets loaded in to CI, and no one notices that it's downloading that dataset every single CI run.

stevage 9 hours ago | parent | prev | next [-]

Let's say you're working on an app that incorporates some Italian place names or roads or something. It's easy to imagine how when you build the app, you want to download the Italian region data from geofabrik then process it to extract what you want into your app. You script it, you put the script in your CI...and here we are:

> Just the other day, one user has managed to download almost 10,000 copies of the italy-latest.osm.pbf file in 24 hours!

ahlCVA 8 hours ago | parent | prev | next [-]

Whenever people complain about the energy usage of LLM training runs I wonder how this stacks up against the energy we waste by pointlessly redownloading/recompiling things (even large things) all the time in CI runs.

comprev 5 hours ago | parent | prev | next [-]

Optimising CI pipelines has been a strong aspect of my career so far.

Anybody can build a pipeline to get a task done (thousands of quick & shallow howto blog posts) but doing this efficiently so it becomes a flywheel rather than a blocker for teams is the hard part.

Not just caching but optimising job execution order and downstream dependencies too.

The faster it fails, the faster the developer feedback, and the faster a fix can be introduced.

I quite enjoy the work and always learning new techniques to squeeze extra performance or save time.

raverbashing 9 hours ago | parent | prev | next [-]

Also for some reason, most CI runners seem to cache nothing except for that minor thing that you really don't want cached.

bombcar 5 hours ago | parent [-]

This is exactly it - you can cache all the wrong things easily, cache only the code you wanted changed, or cache nothing but one small critical file nobody knows about.

No wonder many just turn caching entirely off at some point and never turn it back on.

mschuster91 9 hours ago | parent | prev [-]

CI itself doesn't have to be a waste. The problem is most people DGAF about caching.

marklit 10 hours ago | parent | prev | next [-]

I suspect web apps that "query" the GPKG files. Parquet can be queried surgically, I'm not sure if there is a way to do the same with GPKG.

aitchnyu 11 hours ago | parent | prev | next [-]

Can we identify requests from CI servers reliably?

IshKebab 10 hours ago | parent | next [-]

You can identify requests from Github's free CI reliably which probably covers 99% of requests.

For example GMP blocked GitHub:

https://www.theregister.com/2023/06/28/microsofts_github_gmp...

This "emergency measure" is still in place, but there are mirrors available so it doesn't actually matter too much.

ncruces 10 hours ago | parent [-]

I try to stick to GitHub for GitHub CI downloads.

E.g. my SQLite project downloads code from the GitHub mirror rather than Fossil.

Gigachad 10 hours ago | parent | prev [-]

Sure, have a js script involved in generating a temporary download url.

That way someone manually downloading the file is not impacted, but if you try to put the url in a script it won’t work.

eleveriven 6 hours ago | parent | prev [-]

Having some kind of lightweight auth (API key, even just email-based) is a good compromise