Remix.run Logo
glitchc 5 days ago

No. This is not a solution.

While git LFS is just a kludge for now, writing a filter argument during the clone operation is not the long-term solution either.

Git clone is the very first command most people will run when learning how to use git. Emphasized for effect: the very first command.

Will they remember to write the filter? Maybe, if the tutorial to the cool codebase they're trying to access mentions it. Maybe not. What happens if they don't? It may take a long time without any obvious indication. And if they do? The cloned repo might not be compilable/usable since the blobs are missing.

Say they do get it right. Will they understand it? Most likely not. We are exposing the inner workings of git on the very first command they learn. What's a blob? Why do I need to filter on it? Where are blobs stored? It's classic abstraction leakage.

This is a solved problem: Rsync does it. Just port the bloody implementation over. It does mean supporting alternative representations or moving away from blobs altogether, which git maintainers seem unwilling to do.

IshKebab 5 days ago | parent | next [-]

I totally agree. This follows a long tradition of Git "fixing" things by adding a flag that 99% of users won't ever discover. They never fix the defaults.

And yes, you can fix defaults without breaking backwards compatibility.

Jenk 5 days ago | parent [-]

> They never fix the defaults

Not strictly true. They did change the default push behaviour from "matching" to "simple" in Git 2.0.

hinkley 5 days ago | parent [-]

So what was the second time the stopped watch was right?

I agree with GP. The git community is very fond of doing checkbox fixes for team problems that aren’t or can’t be set as defaults and so require constant user intervention to work. See also some of the sparse checkout systems and adding notes to commits after the fact. They only work if you turn every pull and push into a flurry of activity. Which means they will never work from your IDE. Those are non fixes that pollute the space for actual fixes.

smohare 5 days ago | parent [-]

I’ve used git since its inception. Never once in an “IDE”. Should users that refuse to learn the tool really be the target?

I’m not trying to argue that interface doesn’t matter. I use jq enough to be in that unfortunate category where I despise its interface. But it is difficult for me to imagine being similarly incapable in git.

hinkley 4 days ago | parent | next [-]

Developers who insist that tools and techniques are personal rather than a group decision generally get talked about unkindly. We are all in this together and you have to support things you don’t even use. That’s the facts on the ground, and more importantly, that’s the job.

IshKebab 4 days ago | parent | prev [-]

> Should users that refuse to learn the tool really be the target?

Maybe not, but that's not the only group of people that are affected. It also affects beginners and people that don't want to exhaustively read the manual.

Should they be the target? Obviously yes.

TGower 5 days ago | parent | prev | next [-]

> The cloned repo might not be compilable/usable since the blobs are missing.

Only the histories of the blobs are filtered out.

ks2048 5 days ago | parent | prev | next [-]

> This is a solved problem: Rsync does it.

Can you explain what the solution is? I don't mean the details of the rsync algorithm, but rather what it would like like from the users' perspective. What files are on your local filesystem when you do a "git clone"?

hinkley 5 days ago | parent [-]

When you do a shallow clone, no files would be present. However when doing a full clone you’ll get a full copy of each version of each blob, and what is being suggested is treat each revision as an rsync operation upon the last. And the more times you muck with a file, which can happen a lot both with assets and if you check in your deps to get exact snapshotting of code, that’s a lot of big file churn.

tatersolid 5 days ago | parent [-]

The overwhelming majority of large assets (images, audio, video) will receive near-zero benefit from using the rsync algorithm. The formats generally have massive byte-level differences even after small “tweaks” to a file.

TZubiri 5 days ago | parent | next [-]

Video might be strictly out of scope for git, consider that not even youtube allows 'updating' a video.

This will sound absolutely insane, but maybe the source code for the video should be a script? Then the process of building produces a video which is a release artifact?

qubidt 5 days ago | parent | next [-]

This is relatively niche, but that's a thing for anime fan-encodes. Some groups publish their vapoursynth scripts, allow you to produce the same re-encoding (given you have the same source video). e.g.:

* https://github.com/LightArrowsEXE/Encoding-Projects

* https://github.com/Beatrice-Raws/encode-scripts

TZubiri 3 days ago | parent [-]

Hm, the video itself would probably be referenced by an indexable identifier like "Anime X Season 1 Chapter 5", and provisioning of the actual video would be up to the builder to get (probably from some torrent network or from DVD although no one will do that)

yencabulator 4 days ago | parent | prev | next [-]

> This will sound absolutely insane, but maybe the source code for the video should be a script? Then the process of building produces a video which is a release artifact?

It already kinda is, but that just means you now need access to all the raw footage, and rendering a video file in high quality & good compression takes a long time.

https://en.wikipedia.org/wiki/Edit_decision_list

TZubiri 3 days ago | parent | next [-]

I see, I think in that case the raw video format would still be the source code along with the EDL. What I'm suggesting is that the raw footage would still be an output from the source code that would be the script and the filming plans.

Silly idea, but it's worth thinking about this stuff in an era where the line between source code and target code is being blurred with prompts.

TZubiri 3 days ago | parent | prev [-]

Which is the problem only if you see of "building" as something that should be instantaneous or take a couple of hours tops.

This is similar to replicability in science, there's some experiments that are inmensely expensive to replicate, like LHC, but it still IS replicable technically.

izacus 5 days ago | parent | prev [-]

That is nowhere near practical for even basic use cases like a website or a mobile app.

TZubiri 3 days ago | parent [-]

Isn't it? In practice it means that the "video" should live outside of the git repo, you could just download it from an external repo, and you always have the script to recreate it if it ever goes down.

For example:

PromotionalDemo.mp4.script

"Make a video 10 seconds long showcasing the video, a voice in off should say 'We can click here if we want to do this, or click there if we want to go there'. 1024*768 resolution. Male voice. Perky attitude"

cyberax 5 days ago | parent | prev [-]

A lot of video editing includes splicing/deleting some footage, rather than full video rework. rsync, with its rolling hash approach, can work wonders for this use-case.

bogwog 4 days ago | parent | prev | next [-]

Maybe a manual filter isn't the right solution, but this does seem to add a lot of missing pieces.

The first time you try to commit on a new install, git nags you to set your email address and name. I could see something similar happen the first time you clone a repo that hits the default global filter size, with instructions on how to disable it globally.

> The cloned repo might not be compilable/usable since the blobs are missing.

Maybe I misunderstood the article, but isn't the point of the filter to prevent downloading the full history of big files, and instead only check out the required version (like LFS does).

So a filter of 1 byte will always give you a working tree, but trying to checkout a prior commit will require a full download of all files.

spyrja 5 days ago | parent | prev | next [-]

Would it be incorrect to say that most of the bloat relates to historical revisions? If so, maybe an rsync-like behavior starting with the most current version of the files would be the best starting point. (Which is all most people will need anyhow.)

pizza234 5 days ago | parent [-]

> Would it be incorrect to say that most of the bloat relates to historical revisions?

Based on my experience (YMMV), I think it is incorrect, yes, because any time I've performed a shallow clone of a repository, the saving wasn't as much as one would intuitively imagine (in other words: history is stored very efficiently).

spyrja 5 days ago | parent [-]

Doing a bit of digging seems to confirm that, considering that git actually does remove a lot of redundant files during the garbage collection phase. It does however store complete files (unlike a VCS like mercurial which stores deltas) so nonetheless it still might benefit from a download-the-current-snapshot-first approach.

cesarb 5 days ago | parent [-]

> It does however store complete files (unlike a VCS like mercurial which stores deltas)

The logical model of git is that it stores complete files. The physical model of git is that these complete files are stored as deltas within pack files (except for new files which haven't been packed yet; by default git automatically packs once there are too many of these loose files, and they're always packed in its network protocol when sending or receiving).

olddustytrail 5 days ago | parent [-]

Yes, the problem really stems from the fact that git "understands" text files but not really anything other than that, so it can't really make a good diff between say a jpeg and its updated version, so it simply relies on compression for those other formats.

It would be nice to have a VCS that could manage these more effectively but most binary formats don't lend themselves to that, even when it might be an additional layer to an image.

I reckon there's still room for better image and video formats that would work better with VCS.

expenses3 5 days ago | parent | prev | next [-]

Exactly. If large files suck in git then that's because the git backend and cloning mechanism sucks for them. Fix that and then let us move on.

krupan 4 days ago | parent [-]

That's exactly what these changes do, but they don't become the default because a lot of people only store text in got so they don't want the downsides of these changes

xyzsparetimexyz 2 days ago | parent [-]

What changes? The partial clone stuff doesn't help me given that I generally want the large files to be checked out. And how does the large object provider stuff work if you're not using a git forge.

matheusmoreira 5 days ago | parent | prev | next [-]

It is a solution. The fact beginners might not understand it doesn't really matter, solutions need not perish on that alone. Clone is a command people usually run once while setting up a repository. Maybe the case could be made that this behavior should be the default and that full clones should be opt-in but that's a separate issue.

TZubiri 5 days ago | parent | prev | next [-]

"Will they remember to write the filter? Maybe, "

Nothing wrong with "forgetting" to write the filter, and then if it's taking more than 10 minutes, write the filter.

Too 5 days ago | parent [-]

What? Why would you want to expose a beginner to waiting 10 minutes unnecessarily. How would they even know what they did wrong or what's a reasonable time to wait, ask chatgpt "why is my git clone taking 10 minutes"?!

Is this really the best we can do in terms of user experience? No. git need to step up.

TZubiri 5 days ago | parent [-]

Git is not for beginners in general. And large repos are less for beginners.

A beginner will follow instructions in a README "Run git clone" or "run git clone --depth=1

5 days ago | parent | prev [-]
[deleted]