Remix.run Logo
BobbyTables2 2 hours ago

Damn, I’ve never really had to deal with Unicode all that much.

Was already bad enough that instead of bytes, we have to worry about code points. Now even that isn’t enough?

It would have been expensive, but all characters should have been fixed size 64bit values.

usrnm 2 hours ago | parent | next [-]

> It would have been expensive, but all characters should have been fixed size 64bit values

You're making the same mistake that numerous people made before you: thinking that it's as simple as using arrays of large enough numbers. First they thought that two bytes per symbol would be enough, then four. Spoiler alert: it wasn't. And eight won't work either.

201984 28 minutes ago | parent | next [-]

Why wouldn't 8 be enough? Surely 18,446,744,070,000,001,024 characters is enough for every writing system in the world.

usrnm 8 minutes ago | parent [-]

Because that's not how Unicode works. It's not simply a table mapping numbers to all possible symbols

bombcar an hour ago | parent | prev [-]

UnicodeV6 - 128 bits per character!

chuckadams 2 hours ago | parent | prev [-]

> It would have been expensive, but all characters should have been fixed size 64bit values.

It would have been a non-starter, and then we'd all be dealing with Shift-JIS, BIG5, and FSM knows how many different codepages to this day. UTF-8 is about as elegant as it gets, though Java and JS still managed to fuck that up too (they both encode every codepoint outside the BMP as surrogate pairs in UTF-8)

chrismorgan an hour ago | parent | next [-]

> Java and JS […] both encode every codepoint outside the BMP as surrogate pairs in UTF-8

I can’t comment on Java, but JS I know reasonably well and I can’t think of any place it uses CESU-8.

dasyatidprime an hour ago | parent | prev [-]

That's called CESU-8. https://www.unicode.org/reports/tr26/tr26-4.html