Remix.run Logo
Sophira 4 days ago

How does this upcoming feature deal with the potential problem of fake commit IDs?

Commit IDs are based on a number of factors about the commit, including the actual contents and the commit ID of the parent commit. Any fully cloned git repository can theoretically be audited to make sure that all its commit IDs are correct. Nobody does this (although perhaps git does automatically?), but it's possible.

But now, picture a git repository that has a one petabyte file in one of its early commits (and deleted again later). Pretty much nobody is going to have the space required to download this, so many people will not even bother to do so. As such, what's to stop the server from just claiming any commit ID it wanted for this particular commit? Who's going to check?

(Bonus: For that matter, is the one petabyte file even real? Or just a faked size in the metadata?)

To be clear, I assume people have already thought about these issues. I'm just curious what the answers are.

kbolino 4 days ago | parent [-]

If you think of a commit as a Merkle tree, then a file's content is a leaf node, and thus has nothing to verify. It either exists or it doesn't. Non-existence creates usability problems, but not verifiability problems.

Of course, remote files can be used to sneak things in, but those things still have to get approved the same as any other commit content. You should not approve PRs etc. that reference remote files you haven't verified. And, while remote storage could be vulnerable to collision attacks in a way that git itself mostly isn't, git-lfs for example already uses SHA-256.