Just a reminder that GitHub is not git.

The article mentions that most of these projects did use GitHub as a central repo out of convenience so there’s that but they could also have used self-hosted repos.

▲

machinationu 12 hours ago | parent | next [-]

Explain to me how you self-host a git repo which is accessed millions of time a day from CI jobs pulling packages.

▲

freedomben 11 hours ago | parent | next [-]

I'm not sure whether this question was asked in good faith, but is actually a damn good one.

I've looked into self hosting and git repo that has horizontal scalability, and it is indeed very difficult. I don't have the time to detail it in a comment here, but for anyone who is curious it's very informative to look at how GitLab handled this with gitaly. I've also seen some clever attempts to use object storage, though I haven't seen any of those solutions put heavily to the test.

I'd love to hear from others about ideas and approaches they've heard about or tried

https://gitlab.com/gitlab-org/gitaly

▲

fweimer 11 hours ago | parent | prev | next [-]

These days, people solve similar problems by wrapping their data in an OCI container image and distribute it through one of the container registries that do not have a practically meaningful pull rate limit. Not really a joke, unfortunately.

▲

mystifyingpoi 6 hours ago | parent [-]

Even Amazon encourages this, probably not intentionally, more like as a bandaid for bad EKS config that people can do by mistake, but still - you can pull 5 terabytes from ECR for free under their free tier each month.

	▲	XorNot 4 hours ago \| parent [-]
		I'd say it'd just Kubernetes in general should've shipped with a storage engine and an installation mechanism. It's a very hacky feeling addon that RKE2 has a distributed internal registry if you enable it and use it in a very specific way. For the rate at which people love just shipping a Helm chart, it's actually absurdly hard to ship a self contained installation without just trying to hit internet resources.

▲

ozim 12 hours ago | parent | prev | next [-]

FTFY:

Explain to me how you self-host a git repo without spending any money and having no budget which is accessed millions of time a day from CI jobs pulling packages.

▲

adrianN 11 hours ago | parent | prev | next [-]

You git init —-bare on a host with sufficient resources. But I would recommend thinking about your CI flow too.

▲

machinationu 10 hours ago | parent [-]

no, hundred of thousands of thousands of individual projects CI jobs. OP was talking about package managers for the whole world, not for one company

▲

adrianN 8 hours ago | parent [-]

If people depend on remote downloads from different companies for their CI pipelines they’re doing it wrong. Every sensible company sets up a mirror or at least a cache on infra that they control. Rate limiting downloads is the natural course of action for the provider of a package registry. Once you have so many unique users that even civilized use of your infrastructure becomes too much you can probably hire a few people to build something more scalable.

	▲	machinationu 6 hours ago \| parent [-]
		numpy had 16M downloads yesterday, at 10 MB that's 160 TB of traffic. It's one package. And there are no rate limits on pypi. https://clickpy.clickhouse.com/dashboard/numpy

▲

9 hours ago | parent | prev [-]

[deleted]

▲

justincormack 12 hours ago | parent | prev [-]

They probably would have experienced issues way sooner, as the self hosted tools don't scale nearly as well.