I can understand in theory why they wouldn't want to back up .git folders as-is. Git has a serious object count bloat problem if you have any repository with a good amount of commit history, which causes a lot of unnecessary overhead in just scanning the folder for files alone.

I don't quite understand why it's still like this; it's probably the biggest reason why git tends to play poorly with a lot of filesystem tools (not just backups). If it'd been something like an SQLite database instead (just an example really), you wouldn't get so much unnecessary inode bloat.

At the same time Backblaze is a backup solution. The need to back up everything is sort of baked in there. They promise to be the third backup solution in a three layer strategy (backup directly connected, backup in home, backup external), and that third one is probably the single most important one of them all since it's the one you're going to be touching the least in an ideal scenario. They really can't be excluding any files whatsoever.

The cloud service exclusion is similarly bad, although much worse. Imagine getting hit by a cryptoworm. Your cloud storage tool is dutifully going to sync everything encrypted, junking up your entire storage across devices and because restoring old versions is both ass and near impossible at scale, you need an actual backup solution for that situation. Backblaze excluding files in those folders feels like a complete misunderstanding of what their purpose should be.

▲

adithyassekhar 5 hours ago | parent | next [-]

I don’t think this is the right way to see this.

Why should a file backup solution adapt to work with git? Or any application? It should not try to understand what a git object is.

I’m paying to copy files from a folder to their servers just do that. No matter what the file is. Stay at the filesystem level not the application level.

▲

noirscape 5 hours ago | parent | next [-]

I'm not saying Backblaze should adapt to git; the issue isn't application related (besides git being badly configured by default; there's a solution with git gc, it's just that git gc basically never runs).

It's that to back up a folder on a filesystem, you need to traverse that folder and check every file in that folder to see if it's changed. Most filesystem tools usually assume a fairly low file count for these operations.

Git, rather unusually, tends to produce a lot of files in regular use; before packing, every commit/object/branch is simply stored as a file on the filesystem (branches only as pointers). Packing fixes that by compressing commit and object files together, but it's not done by default (only after an initial clone or when the garbage collector runs). Iterating over a .git folder can take a lot of time in a place that's typically not very well optimized (since most "normal" people don't have thousands of tiny files in their folders that contain sprawled out application state.)

The correct solution here is either for git to change, or for Backblaze to implement better iteration logic (which will probably require special handling for git..., so it'd be more "correct" to fix up git, since Backblaze's tools aren't the only ones with this problem.)

	▲	masfuerte 4 hours ago \| parent \| next [-]
		7za (the compression app) does blazingly fast iteration over any kind of folder. This doesn't require special code for git. Backblaze's backup app could do the same but rather than fix their code they excluded .git folders. When I backup my computer the .git folders are among the most important things on there. Most of my personal projects aren't pushed to github or anywhere else. Fortunately I don't use Backblaze. I guess the moral is don't use a backup solution where the vendor has an incentive to exclude things.
	▲	NetMageSCW 2 hours ago \| parent \| prev [-]
		Actually once the initial backup is done there is no reason to scan for changes. They can just use a Windows service that tells them when any file is modified or created and add that file to their backup list.

▲

Saris 3 hours ago | parent | prev [-]

Backblaze offers 'unlimited' backup space, so they have to do this kind of thing as a result of that poor marketing choice.

	▲	conductr an hour ago \| parent [-]
		No they don’t. They just have to price the product to reflect changing user patterns. When backblaze started, it was simply “we back up all the files on your drive” they didn’t even have a restore feature that was your job when you needed it. Over time they realized some user behavior changed, these Cloud drives where a huge data source they hadn’t priced in, git gave them some problems that they didn’t factor in, etc. The issue is there solution to dealing with it is to exclude it and that means they’re now a half baked solution to many of their users, they should have just changed the pricing and supported the backup solution people need today.

▲

Ajedi32 an hour ago | parent | prev | next [-]

FWIW some other people in this thread are saying the article is wrong about .git folders not being backed up: https://news.ycombinator.com/item?id=47765788

That's a really important fact that's getting buried so I'd like to highlight it here.

▲

rmccue 6 hours ago | parent | prev | next [-]

I think it's understandable for both Backblaze and most users, but surely the solution is to add `.git` to their default exclusion list which the user can manage.

▲

maalhamdan 6 hours ago | parent | prev | next [-]

I think they shouldn't back up git objects individually because git handles the versioning information. Just compress the .git folder itself and back it up as a single unit.

▲

willis936 6 hours ago | parent | next [-]

Better yet, include dedpulication, incremental versioning, verification, and encryption. Wait, that's borg / restic.

This is a joke, but honestly anyone here shouldn't be directly backing up their filesystems and should instead be using the right tool for the job. You'll make the world a more efficient place, have more robust and quicker to recover backups, and save some money along the way.

▲

pkaeding 6 hours ago | parent | prev [-]

This is a good point, but you might expect them to back up untracked and modified files in the backup, along with everything else on your filesystem.

▲

pixl97 4 hours ago | parent [-]

Eh, you really shouldn't do that for any kind of file that acts like a (an impromptu) database. This is how you get corruption. Especially when change information can be split across more than one file.

	▲	pkaeding 2 hours ago \| parent [-]
		Sorry, what are you saying shouldn't be done? Backing up untracked/modified files in a bit repo? Or compressing the .git folder and backing it up as a unit?

▲

rcxdude 6 hours ago | parent | prev | next [-]

It's probably primarily because Linus is a kernel and filesystem nerd, not a database nerd, so he preferred to just use the filesystem which he understood the performance characteristics of well (at least on linux).

▲

ciupicri 6 hours ago | parent | prev | next [-]

> If it'd been something like an SQLite database instead (just an example really)

See Fossil (https://fossil-scm.org/)

P.S. There's also (https://www.sourcegear.com/vault/)

> SourceGear Vault Pro is a version control and bug tracking solution for professional development teams. Vault Standard is for those who only want version control. Vault is based on a client / server architecture using technologies such as Microsoft SQL Server and IIS Web Services for increased performance, scalability, and security.

▲

grumbelbart2 6 hours ago | parent | prev | next [-]

Git packs objects into pack-files on a regular basis. If it doesn't, check your configuration, or do it manually with 'git repack'.

▲

noirscape 5 hours ago | parent [-]

I decided to look into this (git gc should also be doing this), and I think I figured out why it's such a consistent issue with git in particular. Running git gc does properly pack objects together and reduce inode count to something much more manageable.

It's the same reason why the postgres autovacuum daemon tends to be borderline useless unless you retune it[0]: the defaults are barmy. git gc only runs if there's 6700 loose unpacked objects[1]. Most typical filesystem tools tend to start balking at traversing ~1000 files in a structure (depends a bit on the filesystem/OS as well, Windows tends to get slower a good bit earlier than Linux).

To fix it, running

> git config --global gc.auto 1000

should retune it and any subsequent commit to your repo's will trigger garbage collection properly when there's around 1000 loose files. Pack file management seems to be properly tuned by default; at more than 50 packs, gc will repack into a larger pack.

[0]: For anyone curious, the default postgres autovacuum setting runs only when 10% of the table consists of dead tuples (roughly: deleted+every revision of an updated row). If you're working with a beefy table, you're never hitting 10%. Either tune it down or create an external cronjob to run vacuum analyze more frequently on the tables you need to keep speedy. I'm pretty sure the defaults are tuned solely to ensure that Postgres' internal tables are fast, since those seem to only have active rows to a point where it'd warrant autovacuum.

[1]: https://git-scm.com/docs/git-gc

	▲	LetTheSmokeOut 4 hours ago \| parent \| next [-]
		I needed to use > git config --global gc.auto 1000 with the long option name, and no `=`.
	▲	Dylan16807 4 hours ago \| parent \| prev [-]
		A few thousand files shouldn't be a problem to a program designed to scan entire drives of files. Even in a single folder and considering sloppy programs I wouldn't worry just yet, and git's not putting them in a single folder.

▲

yangm97 5 hours ago | parent | prev | next [-]

You don’t see ZFS/BTRFS block based snapshot replication choking on git or any sort of dataset. Use the right job for the tool or something.

▲

4 hours ago | parent | prev [-]

[deleted]