Remix.run Logo
bob1029 5 days ago

> Large object promisors are special Git remotes that only house large files.

I like this approach. If I could configure my repos to use something like S3, I would switch away from using LFS. S3 seems like a really good synergy for large blobs in a VCS. The intelligent tiering feature can move data into colder tiers of storage as history naturally accumulates and old things are forgotten. I wouldn't mind a historical checkout taking half a day (i.e., restored from a robotic tape library) if I am pulling in stuff from a decade ago.

riedel 5 days ago | parent | next [-]

The article mentions alternatives to git lfs like git annex that support S3 already (which IMHO is however still a bit of a pain in the ass on windows due to the symlink workflow). Also dvc plays nicely with git and S3. Gitlab btw also simply offloads git lfs to S3. All have their quirks. I typically opt for LFS as a no-brainer but use the others when it fits the workflow and the infrastructure requirements.

Edit: Particularly the hash algorithm and the change detection (also when this happens) makes a difference if you have 2 GB files and not only the 25MB file from the OP

a_t48 5 days ago | parent | prev | next [-]

At my current job I've started caching all of our LFS objects in a bucket, for cost reasons. Every time a PR is run, I get the list of objects via `git lfs ls-files`, sync them from gcp, run `git lfs checkout` to actually populate the repo from the object store, and then `git lfs pull` to pick up anything not cached. If there were uncached objects, I push them back up via `gcloud storage rsync`. Simple, doesn't require any configuration for developers (who only ever have to pull new objects), keeps the Github UI unconfused about the state of the repo.

I'd initially at spinning up an LFS backends, but this solves the main pain point, for now. Github was charging us an arm and a leg for pulling LFS files for CI, because each checkout is fresh, the caching model is non-ideal (max 10GB cache, impossible to share between branches), so we end up pulling a bunch of data that is unfortunately in LFS, every commit, possibly multiple times. Because of this they happily charge us for all that bandwidth, because they don't provide tools to make it easy to reduce bandwidth (let me pay for more cache size, or warm workers with an entire cache disc, or better cache control, or...).

...and if I want to enable this for developers it's relatively easy, just add a new git hook to do the same set of operations locally.

tagraves 5 days ago | parent | next [-]

We use a somewhat similar approach in RWX when pulling LFS files[1]. We run `git lfs ls-files` to get a list of the lfs files, then pass that list into a task which pulls each file from the LFS endpoint using curl. Since in RWX the output of tasks are cached as long as their inputs don't change, the LFS files just stay in the RWX cache and are pulled from there on future clones in CI. In addition to saving on GitHub's LFS bandwidth costs, the RWX cache is also _much_ faster to restore from than `git lfs pull`.

[1] https://github.com/rwx-cloud/packages/blob/main/git/clone/bi...

a_t48 5 days ago | parent [-]

Nice! I was considering using some sort of pull through cache like this, but went with the solution that didn’t require setting up more infra than a bucket.

cyberax 5 days ago | parent | prev | next [-]

> let me pay for more cache size

Apparently, this is coming in Q3 according to their public roadmap: https://github.com/github/roadmap/issues/1029

a_t48 5 days ago | parent [-]

Awesome. Technically you can go over the limit right now (ours was saying 93/10GB last I checked), but I don’t know the eviction policy. I’d rather pay a bit more and know for sure when data will stick around.

gmm1990 5 days ago | parent | prev [-]

Why not run some open source ci locally or the google equivalent ec2, if you’re already going to the trouble of this much customization with running GitHub ci?

a_t48 5 days ago | parent [-]

It was half a day of work to make a drop in action.yml that does this. Saved a bunch of money (both in bandwidth and builder minutes), well worth the investment. It really wasn’t a lot of customization.

All our builds are on GHA definitions, there’s no way it’s worth it to swap us over to another build system, administer it, etc. Our team is small (two at the time, but hopefully doubling soon!), and there’s barely a dozen people in the whole engineering org. The next hit list item is to move from GH hosted builders to GCE workers to get a warmer docker cache (a bunch of our build time is spent pulling images that haven’t changed) - it will also save a chunk of change (GCE workers are 4x cheaper per minute and the caching will make for faster builds), but the opportunity cost for me tackling that is quite high.

gmm1990 4 days ago | parent | next [-]

Ah interesting, I was just curious. I’ve wasted some time setting up ci runners stuff on bare metal servers just because I’ve heard runners from gitlab/github can be expensive

fmbb 5 days ago | parent | prev [-]

Doesn’t the official docker build push action support caching with the GitHub Actions cache?

a_t48 4 days ago | parent [-]

Yes but one image push for us is >10GB, due to ML dependencies. And even if it is intelligent and supports per layer caching, you can’t share between release branches - https://github.com/docker/build-push-action/issues/862.

And even if that did work, I’ve found it much more reliable to use the actual docker BuildX disk state than to try and get caching for complex multi stage builds working reliably. I have a case right now where there’s no combination of —cache-to/from flags that will give me a 100% cached rebuild starting from a fresh builder, using only remote cache. I should probably report it to the Docker team, but I don’t have a minimal repro right now and there’s a 10% chance it’s actually my fault.

kylegalbraith 4 days ago | parent [-]

You should try this with Depot [0]. I’m a founder of it and this is definitely one of the use cases we built it for. We persist your layer cache to a real NVMe device and reattach that automatically across builds. No more needing to save your layer cache via networks or resource constrained GitHub Actions cache.

[0] https://depot.dev

nullwarp 5 days ago | parent | prev | next [-]

Same and never understood why it wasn't the default from the get go but maybe it wasn't so synonymous when it first came out.

I run a small git LFS server because of this and will be happy to switch away the second I can get git to natively support S3.

_bent 5 days ago | parent | prev | next [-]

I'm currently running https://github.com/datopian/giftless to store the LFS files belonging to repos I have on GitHub on my homelab miniio instance.

There are a couple other projects that bridge S3 and LFS, though I had the most success with this setup.

account42 3 days ago | parent | prev | next [-]

Why does the git client need specific support for this though? What's stopping a git host from redirecting requests certain objects to a different host and refusing to pack them into bundles today?

5 days ago | parent | prev | next [-]
[deleted]
johnisgood 5 days ago | parent | prev [-]

Is S3 related to Amazon?

bayindirh 5 days ago | parent | next [-]

You can install your own S3 compatible storage system on premises. It can be anything from a simple daemon (Scality, JuiceFS) to a small appliance (TrueNAS) to a full-blown storage cluster (Ceph). OpenStack has it own object storage service (Swift).

If you fancy it for your datacenter, big players (Fujitsu, Lenovo, Huawei, HPE) will happily sell you "object storage" systems which also support S3 at very high speeds.

yugoslavia4ever 5 days ago | parent [-]

And for CI and local development testing you can use localstack which runs in a docker container and has implementations of most AWS services

bayindirh 5 days ago | parent [-]

Oh, that sounds interesting. We don't use AWS, but it's a nice alternative for people using AWS for their underpinnings.

Scality's open source S3 Server also can run in a container.

StopDisinfo910 5 days ago | parent | prev | next [-]

Yes, S3 is the name of Amazon Object Storage Service. Various players in the industry have started offering solutions with a compatible API which some people abusively call S3 too.

flohofwoe 5 days ago | parent | prev [-]

Yeah it's AWS's 'cloud storage service'.

dotancohen 5 days ago | parent [-]

It's actually 'Simple Storage Service', hence the acronym S3.