Remix.run Logo
Tell me why this UUID pattern is stupid
4 points by yakkasean 7 hours ago | 3 comments

Considering if you have multiple databases with uuidv7 pk’s and debezium+kafka pushing updates from each to the others, and simple consumers that just insert, delete and update records from the Kafka queue, you’ll end up with round trips for each operation at a minimum, and infinite loops if your debezium connector doesn’t drop messages where before and after are identical.

Why not then implement a different uuid spec similar to v7 in which the last 62 or 60 bits are a combination of random data and an identifier of the datacenter (say the first 30 bits of the hash of the datacenter.

Then in your debezium connectors simply add a plugin that takes the datacenter name as a config option and parses the uuid of every payload’s record and drops those in which the datacenter bits do not match its config, thereby eliminating round trips and also risks of infinite loops.

There’s a somewhat greater risk of collision, but as long as you’re not taking tens of thousand of inserts per ms should be fine. Maybe this is a poor man’s version of something that has a more robust spec, but it seems viable to me.

benoau 7 hours ago | parent [-]

It sounds similar to Twitter's Snowflake?

https://en.wikipedia.org/wiki/Snowflake_ID

yakkasean 7 hours ago | parent [-]

It does look similar. I’m confused though how they coordinate sequence bits with only 2^10 (1024) workers. Surely they have more web servers than that, so sequence must be coordinated in a centralized way. Also, this is a 62 bit spec.

benoau 3 hours ago | parent [-]

There is source code available for it (although the repo's retired):

https://github.com/twitter-archive/snowflake/blob/b3f6a3c6ca...

Basically one big server had a number of threads each requesting the next value in the sequence on that machine. The sequence only had to be unique to that machine and was just a counter of how many tweets since <last whole millisecond>. Tweets per millisecond per server was probably never a huge number, so they were able to share the 10 bits between datacenter id and that counter.