Remix.run Logo
sedatk 3 days ago

Whenever ULID comes up, I need to remind that it has a sequential ID generation mode in its spec which is prone to conflicts on multi-threads, processes or hosts which kills the purpose of a "universal" identifier. If you need a sequential ID, just use an integer, preferably one that's autoincremented by the database.

It's best to stick to UUIDv7 because of such quirks of ULID.

cpburns2009 3 days ago | parent | next [-]

> I need to remind that it has a sequential ID generation mode in its spec which is prone to conflicts on multi-threads, processes or hosts which kills the purpose of a "universal" identifier.

Can you expand on how this can actually cause a problem? My understanding is different processes and hosts should never conflict because of the 80 bits of random data. The only way I can conceive of a conflict is multiple threads using the same non-thread-safe generator during the same millisecond.

sedatk 3 days ago | parent [-]

You're right, not hosts or processes in that case. I forgot about random part as it's been a while since I looked at it. However, a single instance of a ULID generator must support this mode, which means that on multi-threaded architectures, it must lock the sequence as it still uses a single random value. That again, kills the purpose of a client-side, lock-free generation of universal identifiers as you said.

0x457 3 days ago | parent | next [-]

You only need to lock sequence if you care about IDs being ordered within a millisecond. That generally only matters when you create a batch of IDs at once, in that case you don't need to lock anything: generate ULID, keep incrementing sequence in that batch either by doing on the same thread, or by moving it from thread to thread. Kinda like creating an iterator and zip'ing it with iterator of thing you need IDs for.

I've switched to using UUIDv7 tho. It made sense to use ULID before v7, but now ULID only has one thing going on - smaller string representation. That doesn't matter if your storage can store UUIDs natively (i.e. as 128 bit integer)

If your goal is to have global order intact, then neither ULID nor UUIDv7 is going to work for you.

sedatk 3 days ago | parent | next [-]

> You only need to lock sequence if you care about IDs being ordered within a millisecond

Yes, and that's when sequences are only used. I guess that's to avoid hogging the CPU or emptying the OS entropy pool during high loads.

However, that "optimization" is a failure mode if you're not aware how ULID internals work. It's easy to shoot yourself in the foot by blindly trusting ULID will always generate a unique ID across threads without blocking your thread. That's a sneaky footgun.

> That generally only matters when you create a batch of IDs at once

No, any web service instance can receive requests at arbitrary times, and sometimes in the same millisecond zone. The probability is proportional to the number of concurrent users and requests.

> If your goal is to have global order intact, then neither ULID nor UUIDv7 is going to work for you.

Agreed.

0x457 13 hours ago | parent | next [-]

> No, any web service instance can receive requests at arbitrary times, and sometimes in the same millisecond zone. The probability is proportional to the number of concurrent users and requests.

Yes, but does it matter that you have out of order IDs within the same ms for concurrent requests? That's why I said batch. I only ever been an issue for me when I've chosen ULID as an ID for an event log (if the command produced more than one event, random bits will ruin the order)

> However, that "optimization" is a failure mode if you're not aware how ULID internals work.

That's not ULID internals, that's whatever library you're using. The rust implementation I've used, for example, will generate random bits unless you implicitly increment, and that requires `&mut`

jasonwatkinspdx 3 days ago | parent | prev [-]

> or emptying the OS entropy pool during high loads.

Just a heads up that's not really a thing. If the CSPRNG is initialized correctly you're done. There's nothing being depleted. I know for ages the linux docs said different, they were just wrong and a maintainer was keeping a weird little fiefdom over it.

sedatk 3 days ago | parent [-]

Thanks for the heads up, then it’s one less reason for ULID to adopt this weird behavior.

vbezhenar 3 days ago | parent | prev [-]

I hope that's not literally incrementing a sequence. Because it would lead to trivial neighbor ID guessing attacks.

I've implemented this thing, though not called it ULID. I've dedicated some bits for timestamp, some bits for counter within millisecond and rest for randomness. So they always ordered and always unpredictable.

Another approach is to keep latest generated UUID and if new UUID requested within the same timestamp - generate random part until it's greater than previous one. I think that's pretty good approach as well.

sedatk 3 days ago | parent | next [-]

> I hope that's not literally incrementing a sequence

It's literally incrementing it by one:

https://github.com/ulid/javascript/blob/11c2067821ee19e4dc78...

https://github.com/ulid/javascript/blob/11c2067821ee19e4dc78...

vbezhenar 3 days ago | parent [-]

Well, that makes little sense for me, you can just use numeric identifier instead. Bulk inserts which generate identifiers in bulk are commonly used.

But that's easy to fix, so just implementation quirk for this particular library, the idea is sound.

sedatk 3 days ago | parent [-]

> But that's easy to fix, so just implementation quirk for this particular library, the idea is sound.

It's in ULID spec.

jasonwatkinspdx 3 days ago | parent | prev [-]

> I hope that's not literally incrementing a sequence. Because it would lead to trivial neighbor ID guessing attacks.

It is and it does.

Also the ULID spec suggests you use a CSPRNG, but doesn't mandate that or provide specific advice on appropriate algorithms. So in practice people may reach for whatever hash function is convenient in their project, which may just be FNV or similar with considerably weaker randomness too.

cpburns2009 3 days ago | parent | prev [-]

If you really need lock-free generation, you can use an alternate generator that uses new random bits for every submillisecond id. That's what the `ulid-py` library for Python does by default instead of incrementing the random bits.

sedatk 3 days ago | parent [-]

Yes, the problem is that this mode is supported and required per the spec. So, a developer must know the pros/cons of this mode. It requires them to correctly assess the consequences. It's quite easy to shoot themsleves in the foot especially when a solid alternative like UUIDv7 exists.

unscaled 3 days ago | parent | prev | next [-]

The monotonic behavior is not the default, but I would also be happier if it was removed from the spec or at least marked with all the appropriate warning signs on all the libraries implementing it.

But I don't think UUIDv7 solves the issue by "having less quirks". Just like you'd have to be careful to use the non-monotonic version of ULID, you'd have to be careful to use the right version of UUID. You also have to hope that all of your UUID consumers (which would almost invariably try to parse or validate the UUID, even if they do nothing with it) support UUIDv7 or don't throw on an unknown version.

sedatk 3 days ago | parent [-]

UUIDv7 is the closest to ULID as both are timestamp based, and UUIDv7 has fewer quirks than ULID, no question about it.

I agree that picking UUID variant requires caution, but when someone has already picked ULID, UUIDv7 is easily a superior alternative.

skeledrew 3 days ago | parent | prev | next [-]

Actually dived into this a bit just a couple days ago. It's very near impossibly for there to be a conflict since the timestamp resolves at the microsecond level, and if it's among threads, then there's a global state that, if somehow it should be hit 2+ times in the same microsecond, ensures detection and the random portion is incremented.

listenallyall 3 days ago | parent | prev | next [-]

Under what circumstances is it prone to conflicts? On separate threads/hosts/processes, id's created within the same millisecond would be differentiated by the 80 bits of randomness (more than UUID v7).

jasonwatkinspdx 3 days ago | parent | next [-]

No, ULID has a "monotonic" feature, where if it detects the same millisecond timestamp in back to back calls, it just increments the 80 bit "random" portion. This means it has convoying behavior. If two machines are generating ids independently and happen to choose initial random positions near each other, the probability of collision is much higher than the basic birthday bound.

I think this "sort of monotonic but not really" is the worst of both to be honest. It tempts you to treat it like an invariant when it isn't.

If you want monotonicity with independent generation, just use a composite key that's a lamport clock and a random nonce. Or if you want to be even more snazzy use Hybrid Logical Clocks or similar.

Dylan16807 3 days ago | parent | next [-]

> If two machines are generating ids independently and happen to choose initial random positions near each other, the probability of collision is much higher than the basic birthday bound.

But the chance of the initial random positions being near each other is very very low.

If you pick a billion random numbers in an 80 bit space, the chance you have a collision is one in a million. (2^80 / (2^30)^2)

If you pick a thousand random starting points and generate a million sequential numbers each, the chance your starting points are sufficiently close to each other to cause an overlap is one in a trillion. ((2^80 / 2^20) / (2^10)^2)

In that one in a trillion case, you'll likely end up with half a million collisions, which might matter to you. But if you care about 0 collisions versus 1+ collisions, pick the monotonic version.

jasonwatkinspdx 3 days ago | parent [-]

Right, but the point is there's no reason to accept this limitation. Likewise why hardcode millisecond scale timestamps in a world where billions of inserts per second are practical on a single server?

Or if what you want is monotonic distributed timestamps, again, HLC is how you do that properly.

So you're embracing this weird limitations for no real benefit.

And as you can see in the rest of this comment thread, a lot of people simply do not even know this behavior and are assuming the 80 bit portion is always random. Which is my whole point about having a not really an invariant invariant just being a bad way to go fundamentally.

Edit: just to reply to the below since I can't do so directly, I understand the arithmetic here. What I'm saying is there's zero reason to choose this weird scheme vs something that's just the full birthday bound and you never think about it again.

As another comment points out: just consider neighbor guessing attacks. This 80 bit random but not random field is a foot gun to anyone that just assumes they can treat it as truly random.

listenallyall 3 days ago | parent | next [-]

It's not a "limitation", he's saying there is much, much, much less chance of having any collisions with ULIDs - one in a million vs one in a trillion

Dylan16807 3 days ago | parent | prev [-]

> why hardcode millisecond scale timestamps in a world where billions of inserts per second are practical on a single server?

> Or if what you want is monotonic distributed timestamps, again, HLC is how you do that properly.

Why not just 64 bit timestamps stapled to a random number? You can be collision proof and monotonic without doing anything fancy.

> this weird scheme vs something that's just the full birthday bound and you never think about it again

But the weird scheme gives you better odds than the birthday bound.

listenallyall 3 days ago | parent | prev [-]

I dont think this holds up. Define "near each other" in an 80-bit random space. Further, the likelihood of a potential conflict is offset by the fact that you have far fewer "initial random positions" instead of every single element defining its own random position. And the extra random bits (over UUID7) reduce conflict possibilities by orders of magnitude.

I concede I'm no mathematician and I could be wrong here, but your analysis feels similar to assuming 10-11-12-13-14-15 is less likely to be a winning lottery ticket because the odds against consecutive numbers are so massive.

jasonwatkinspdx 3 days ago | parent [-]

No, the calculation is straightforward and I'm not making the fallacious assumption you say there at the end about a magical lottery ticket number.

My basic point is the probability of collision is lower than the birthday bound, there's no need for this, and as comments in this thread make clear people are not understanding this limitation even exists with the specification.

listenallyall 3 days ago | parent [-]

> the calculation is straightforward

Ok then, make it easy - your requirement is to independently pick 4 numbers from the range 0 to 9, without resulting in any duplicates. Which is more likely to be successful:

- pick 4 random digits independently

- pick a random digit, which will be appended by the next digit as pick #2 (i.e. if you pick 5, then 6 will automatically be your second digit, if you pick 9, 0 will be your second digit). Then pick once more on the same terms.

The math here is easy: scenario 1 you have 0.9 x 0.8 x 0.7 = 0.504 likelihood of success. Scenario 2 it's simply 0.7.

sedatk 3 days ago | parent | prev [-]

See my sibling comment.

marifjeren 3 days ago | parent | prev | next [-]

> If you need a sequential ID, just use an integer

Are monotonic/sequential ULIDs as easily enumerated as integers? It's the ease of enumerability that keeps a lot of folks away from using sequential integers as IDs

sedatk 3 days ago | parent [-]

You mean someone who wants to attack your system might be discouraged by Base32 encoding?

marifjeren 3 days ago | parent [-]

Sorry, I'm not familiar with the ULID spec. You seem to be, hence my asking. Are you saying monotonic/sequential ULIDs are just (or just as easily enumerated as) Base32-encoded integers?

Oh and yeah, I guess I do think lots of script / AI kiddies would be discouraged by, or fail to see an opportunity when presented with, something that does not look like the numbers they saw in school.

N_Lens 3 days ago | parent | prev [-]

ULID's initial segment is timestamp generated, with a random suffix at the end. This kind of collision you're concerned about is not an issue at all, across multi-threads, processes or hosts.

sedatk 3 days ago | parent [-]

Not if the same ULID generator instance is used across threads.

N_Lens 3 days ago | parent [-]

That depends on if the specific implementation of the generator instance is thread safe. Highly implausible to use the same generator instance between different threads/processes/hosts because there's no benefit at all and only additional downsides.

sedatk 3 days ago | parent [-]

To implement a thread-safe sequential increment, you need locking. When you use locking, then it becomes a “non-universal” ID generator with arbitrary performance impact.

Either it’s collision-prone or locking. Both are problematic in their own way.

It’s footguns all over while UUIDv7 simply exists.

N_Lens 3 days ago | parent [-]

There is practically no need to have a thread-safe ULID generator that would be shared across threads/processes/hosts - a non-scenario that I cannot envision occurring in practice.