Remix.run Logo
I made my own Git(tonystr.net)
125 points by TonyStr 3 hours ago | 50 comments
nasretdinov 38 minutes ago | parent | next [-]

Nice work! On a complete tangent, Git is the only SCM known to me that supports recursive merge strategy [1] (instead of the regular 3-way merge), which essentially always remembers resolved conflicts without you needing to do anything. This is a very underrated feature of Git and somehow people still manage to choose rebase over it. If you ever get to implementing merges, please make sure you have a mechanism for remembering the conflict resolution history :).

[1] https://stackoverflow.com/questions/55998614/merge-made-by-r...

mg794613 3 minutes ago | parent | prev | next [-]

"Though I suck at it, my go-to language for side-projects is always Rust"

Hmm, dont be so hard on yourself!

proceeds to call ls from rust

Ok nevermind, although I dont think rust is the issue here.

(Tony I'm joking, thanks for the article)

teiferer 2 hours ago | parent | prev | next [-]

If you ever wonder how coding agents know how to plan things etc, this is the kind of article they get this training from.

Ends up being circular if the author used LLM help for this writeup though there are no obvious signs of that.

TonyStr an hour ago | parent | next [-]

Interestingly, I looked at github insights and found that this repo had 49 clones, and 28 unique cloners, before I published this article. I definitely did not clone it 49 times, and certainly not with 28 unique users. It's unlikely that the handful of friends who follow me on github all cloned the repo. So I can only speculate that there are bots scraping new public github repos and training on everything.

Maybe that's obvious to most people, but it was a bit surprising to see it myself. It feels weird to think that LLMs are being trained on my code, especially when I'm painfully aware of every corner I'm cutting.

The article doesn't contain any LLM output. I use LLMs to ask for advice on coding conventions (especially in rust, since I'm bad at it), and sometimes as part of research (zstd was suggested by chatgpt along with comparisons to similar algorithms).

Phelinofist 17 minutes ago | parent | next [-]

I selfhost Gitea. The instance is crawled by AI crawlers (checked the IPs). They never cloned, they just browse and take it directly from there.

tonnydourado 9 minutes ago | parent | prev | next [-]

Particularly on GitHub, might not even be LLMs, just regular bots looking for committed secrets (AWS keypairs, passwords, etc.)

0x696C6961 5 minutes ago | parent | prev | next [-]

This has been happening before LLMs too.

nerdponx 33 minutes ago | parent | prev [-]

Time to start including deliberate bugs. The correct version is in a private repository.

wasmainiac 2 hours ago | parent | prev | next [-]

Maybe we can poison LLMs with loops of 2 or more self referencing blogs.

jdiff an hour ago | parent [-]

Only need one, they're not thinking critically about the media they consume during training.

falcor84 an hour ago | parent | next [-]

Here's a sad prediction: over the coming few years, AIs will get significantly better at critical evaluation of sources, while humans will get even worse at it.

whstl 25 minutes ago | parent | next [-]

I wish I could disagree with you, but what I'm seeing on average (especially at work) is exactly that: people asking stuff to ChatGPT and accepting hallucinations as fact, and then fighting me when I say it's not true.

prmoustache 20 minutes ago | parent [-]

There is "death by GPS" for people dying after blindly following their GPS instruction. There will definitely be a "death by AI" expression very soon.

topaz0 44 minutes ago | parent | prev [-]

My sad prediction is that LLMs and humans will both get worse. Humans might get worse faster though.

andy_ppp an hour ago | parent | prev [-]

The secret sauce about having good understanding, taste and style (both for coding and writing) has always been in the fine tuning and RHLF steps. I'd be skeptical if the signals a few GitHub repos or blogs generate at the initial stages of the learning are that critical. There's probably a filter also for good taste on the initial training set and these are so large not even a single full epoch is done on the data these days.

anu7df an hour ago | parent | prev | next [-]

I understand model output put back into training would be an issue, but if model output is guided by multiple prompts and edited by the author to his/her liking wouldn't that at least be marginally useful?

prodigycorp an hour ago | parent | prev | next [-]

Random aside about training data:

One of the funniest things I've started to notice from Gemini in particular is that in random situations, it talks with english with an agreeable affect that I can only describe as.. Indian? I've never noticed such a thing leak through before. There must be a ton of people in India who are generating new datasets for training.

blenderob 20 minutes ago | parent [-]

That's very interesting. Any examples you can share which has those agreeable effects?

mexicocitinluez 2 hours ago | parent | prev [-]

> Ends up being circular if the author used LLM help for this writeup though there are no obvious signs of that.

Great argument for not using AI-assisted tools to write blog posts (especially if you DO use these tools). I wonder how much we're taking for granted in these early phases before it starts to eat itself.

darkryder 2 hours ago | parent | prev | next [-]

Great writeup! It's always fun to learn the details of the tools we use daily.

For others, I highly recommend Git from the Bottom Up[1]. It is a very well-written piece on internal data structures and does a great job of demystifying the opaque git commands that most beginners blindly follow. Best thing you'll learn in 20ish minutes.

1. https://jwiegley.github.io/git-from-the-bottom-up/

spuz an hour ago | parent [-]

Thanks - I think this is the article I was thinking of that really helped me to understand git when I first started using it back in the day. I tried to find it again and couldn't.

holoduke a minute ago | parent | prev | next [-]

I wonder if in the near future there will be no tools anymore in the sense we know it. you will maybe describe the tool you need and its created on the fly.

sublinear a minute ago | parent | prev | next [-]

> If I were to do this again, I would probably use a well-defined language like yaml or json to store object information.

I know this is only meant to be an educational project, but please avoid yaml (especially for anything generated). It may be a superset of json, but that should strongly suggest that json is enough.

I am aware I'm making a decade old complaint now, but we already have such an absurd mess with every tool that decided to prefer yaml (docker/k8s, swagger, etc.) and it never got any better. Let's not make that mistake again.

People just learned to cope or avoid yaml where they can, and luckily these are such widely used tools that we have plenty of boilerplate examples to cheat from. A new tool lacking docs or examples that only accepts yaml would be anywhere from mildly frustrating to borderline unusable.

jrockway 6 minutes ago | parent | prev | next [-]

sha256 is a very slow algorithm, even with hardware acceleration. BLAKE3 would probably make a noticeable performance difference.

Some reading from 2021: https://jolynch.github.io/posts/use_fast_data_algorithms/

It is really hard to describe how slow sha256 is. Go sha256 some big files. Do you think it's disk IO that's making it take so long? It's not, you have a super fast SSD. It's sha256 that's slow.

EdSchouten 2 minutes ago | parent | next [-]

It depends on the architecture. On ARM64, SHA-256 tends to be faster than BLAKE3. The reasons being that most modern ARM64 CPUs have native SHA-256 instructions, and lack an equivalent of AVX-512.

grumbelbart2 3 minutes ago | parent | prev [-]

Is that even when using the SHA256 hardware extensions? https://en.wikipedia.org/wiki/SHA_instruction_set

sluongng 2 hours ago | parent | prev | next [-]

Zstd dictionary compression is essentially how Meta's Mercurial fork (Sapling VCS) stores blobs https://sapling-scm.com/docs/dev/internals/zstdelta. The source code is available in GitHub if folks want to study the tradeoffs vs git delta-compressed packfiles.

I think theoratically, Git delta-compression is still a lot more optimized for smaller repos. But for bigger repos where sharding storaged is required, path-based delta dictionary compression does much better. Git recently (in the last 1 year) got something called "path-walk" which is fairly similar though.

p4bl0 an hour ago | parent | prev | next [-]

Nice post :). It made me think of ugit: DIY Git in Python [1] which is still by far my favorite of this kind of posts. It really goes deep into Git internals while managing to stay easy to follow along the way.

[1] https://www.leshenko.net/p/ugit/

TonyStr an hour ago | parent [-]

This page is beautiful!

Bookmarked for later

h1fra an hour ago | parent | prev | next [-]

Learning git internals was definitely the moment it became clear to me how efficient and smart git is.

And this way of versionning can be reused in other fields, as soon as have some kind of graph of data that can be modified independently but read all together then it makes sense.

igorw an hour ago | parent | prev | next [-]

Random but y'all might enjoy. Git client in PHP, supports reading packfiles, reftables, diff via LCS. Written by hand.

https://github.com/igorwwwwwwwwwwwwwwwwwwww/gipht-horse

nasretdinov an hour ago | parent [-]

Nice! This repo is a huge W for PHP I'd say.

P.S. Didn't know that plain '@' can be used instead of HEAD, but I guess it makes sense since you can omit both left and right parts of the expressions separated by '@'

eru an hour ago | parent | prev | next [-]

> These objects are also compressed to save space, so writing to and reading from .git/objects/ will always involve running a compression algoritm. Git uses zlib to compress objects, but looking at competitors, zstd seemed more promising:

That's a weird thing to put so close to the start. Compression is about the least interesting aspect of Git's design.

alphabetag675 an hour ago | parent [-]

When you are learning, everything is important. I think it is okay to cut the person some slack regarding this.

sneela 2 hours ago | parent | prev | next [-]

> If you want to look at the code, it's available on github.

Why not tvc-hub :P

Jokes aside, great write up!

TonyStr an hour ago | parent [-]

haha, maybe that's the next project. It did feel weird to make git commits at the same time as I was making tvc commits

kgeist 2 hours ago | parent | prev | next [-]

>The hardest part about this project was actually just parsing.

How about using sqlite for this? Then you wouldn't need to parse anything, just read/update tables. Fast indexing out of the box, too.

grenran 2 hours ago | parent | next [-]

that would be what https://fossil-scm.org/ is

TonyStr 2 hours ago | parent [-]

Very interesting. Looks like fossil has made some unique design choices that differ from git[0]. Has anyone here used it? I'd love to hear how it compares.

[0] https://fossil-scm.org/home/doc/trunk/www/fossil-v-git.wiki#...

smartmic 2 hours ago | parent | next [-]

I use Fossil extensively, but only for personal projects. There are specific design conditions, such as no rebasing [0], and overall, it is simpler yet more useful to me. However, I think Fossil is better suited for projects governed under the cathedral model than the bazaar model. It's great for self-hosting, and the web UI is excellent not only for version control, but also for managing a software development project. However, if you want a low barrier to integrating contributions, Fossil is not as good as the various Git forges out there. You have to either receive patches or Fossil bundles via email or forum, or onboard/register contributors as developers with quite wide repo permissions.

[0]: https://fossil-scm.org/home/doc/trunk/www/rebaseharm.md

toyg 30 minutes ago | parent [-]

Sounds like a more modern cvs/Subversion

embedding-shape 2 hours ago | parent | prev | next [-]

Used it on and off mainly to check it out, but always in a personal/experimental capacity. Never managed to convince any teams to give it a try, mostly because git don't tend to get in the way, so hard to justify to learn something completely new.

I really enjoy how local-first it is, as someone who sometimes work without internet connection. That the data around "work" is part of the SCM as well, not just the code, makes a lot of sense to me at a high-level, and many times I wish git worked the same...

usrbinbash 2 hours ago | parent [-]

I mean, git is just as "local-first" (a git repo is just a directory after all), and the standard git-toolchain includes a server, so...

But yeah, fossil is interesting, and it's a crying shame its not more well known, for the exact reasons you point out.

embedding-shape an hour ago | parent [-]

> I mean, git is just as "local-first" (a git repo is just a directory after all), and the standard git-toolchain includes a server, so...

It isn't though, Fossil integrates all the data around the code too in the "repository", so issues, wiki, documentation, notes and so on are all together, not like in git where most commonly you have those things on another platform, or you use something like `git notes` which has maybe 10% of the features of the respective Fossil feature.

It might be useful to scan through the list of features of Fossil and dig into it, because it does a lot more than you seem to think :) https://fossil-scm.org/home/doc/trunk/www/index.wiki

graemep an hour ago | parent | prev [-]

I like it but the problem is everyone else already knows git and everything integrates with git.

It is very easy to self host.

Not having staging is awkward at first but works well once you get used to it.

I prefer it for personal projects. In think its better for small teams if people are willing to adjust but have not had enough opportunities to try it.

TonyStr 3 minutes ago | parent [-]

Is it possible to commit individual files, or specific lines, without a staging area? I guess this might be against Fossil's ethos, and you're supposed to just commit everything every time?

keybored 32 minutes ago | parent | prev [-]

The original reason was that Torvalds thought using the filesystem was better.

heckelson 2 hours ago | parent | prev | next [-]

gentle reminder to set your website's `<title>` to something descriptive :)

TonyStr an hour ago | parent [-]

haha, thank you. Added now :-)

prakhar1144 2 hours ago | parent | prev [-]

I was also playing around with the ".git" directory - ended up writing:

"What's inside .git ?" - https://prakharpratyush.com/blog/7/