Remix.run Logo
azinman2 6 hours ago

Why not just use line numbers?

renewiltord 6 hours ago | parent | next [-]

Forces you to read after every write. E.g. you edit line 15 to be two lines. Then now you need arithmetic for later vs earlier lines or you need to read full file to reindex by line number.

azinman2 6 hours ago | parent [-]

Good point!

I just wonder how unique these hashes will be if only 2 characters. It seems like the collision rate would be really high.

aghilmort 4 hours ago | parent | next [-]

we dug into those sorts of questions with hypertokens, a robust hash for lines, code, tables/rows or any in-context token tagging to give models photographic memory

one mechanism we establish is that each model has a fidelity window, i.e., r tokens of content for s tag tokens; each tag token adds extra GUID-like marker capacity via its embedding vector; since 1,2,3 digit numbers only one token in top models, a single hash token lacks enough capacity & separation in latent space

we also show hash should be properly prefix-free, or unique symbols perp digit, e.g., if using A-K & L-Z to hash then A,R is legal hash whereas M,C is not permitted hash

we can do all this & more rather precisely as we show in our arXiv paper on same; next update goes deeper into group theory, info theory, etc. on boosting model recall, reasoning, tool calls, etc. by way of robust hashing

pbowyer an hour ago | parent [-]

For others, here's the paper: https://arxiv.org/abs/2507.00002

MrGreenTea 3 hours ago | parent | prev [-]

The author writes that these hashes are 2 or 3 characters long. I assume depending on the line count. That's good for almost 48k lines. You have other issues then.

azinman2 3 hours ago | parent [-]

But if it’s a hash vs a line number, then we can collide much more easily.

There many be many lines that are duplicates, eg “{“

giancarlostoro 6 hours ago | parent | prev [-]

I was wondering the same thing.