| ▲ | ai-inquisitor 2 hours ago | ||||||||||||||||
It's not doing that. If you look at the repository, it's adding a new commit with tiny parquet files every 5 minutes. This recent one only was a 20.9 KB parquet file: https://huggingface.co/datasets/open-index/hacker-news/commi... and the ones before it were a median of 5 KB: https://huggingface.co/datasets/open-index/hacker-news/tree/... The bigger concern is how large the git history is going to get on the repository. | |||||||||||||||||
| ▲ | btown an hour ago | parent | next [-] | ||||||||||||||||
I recall that this became a big problem for the Homebrew project in terms of load on the repo, to the extent that Github asked them not to recommend/default-enable shallow clones for their users: https://github.com/Homebrew/brew/issues/15497#issuecomment-1... This is likely to be lower traffic, and the history should (?) scale only linearly with new data, so likely not the worst thing. But it's something to be cognizant of when using SCM software in unexpected ways! | |||||||||||||||||
| |||||||||||||||||
| ▲ | vovavili 2 hours ago | parent | prev [-] | ||||||||||||||||
This makes more sense. I still wonder if the author isn't just effectively recreating Apache Iceberg manually here. | |||||||||||||||||
| |||||||||||||||||