Remix.run Logo
csande17 4 days ago

Unicode really is an impossibly bottomless well of trivia and bad decisions. As another example, the article's RFC warns against allowing legacy ASCII control characters on the grounds that they can be confusing to display to humans, but says nothing about the Explicit Directional Overrides characters that https://www.unicode.org/reports/tr9/#Explicit_Directional_Ov... suggests should "be avoided wherever possible, because of security concerns".

weinzierl 4 days ago | parent | next [-]

I wouldn’t be so harsh. I think the Unicode Consortium not only started with good intentions but also did excellent work for the first decade or so.

I just think they got distracted when the problems got harder, and instead of tackling them head-on, they now waste a lot of their resources on busywork - good intentions notwithstanding. Sure, it’s more fun standardizing sparkling disco balls than dealing with real-world pain points. That OpenType is a good and powerful standard which masks some of Unicode’s shortcomings doesn’t really help.

It’s not too late, and I hope they will find their way back to their original mission and be braver in solving long-standing issues.

zahlman 4 days ago | parent | next [-]

A big part of the problem is that the reaction to early updates was so bad that they promised they would never un-assign or re-assign a code point ever again, making it impossible for them to actually correct any mistakes (not even typos in the official standard names given to characters).

The versioning is actually almost completely backwards by semver reasoning; 1.1 should have been 2.0, 2.0 should have been 3.0 and we should still be on 3.n now (since they have since kept the promise not to remove anything).

socalgal2 4 days ago | parent | prev | next [-]

What could be better? Human languages are complex

weinzierl 4 days ago | parent | next [-]

Yes, exactly, human languages are complex and in my opinion Unicode used to be on a good track to tackle these complexities. I just think that nowadays they are not doing enough to help people around the world solving these problems.

pas 4 days ago | parent [-]

can you describe a few examples? what are you missing? or maybe are you aware of something they rejected that would be useful?

weinzierl 4 days ago | parent [-]

The elephant in the room is Han Unification but there are plenty of other issues. Here is one of my favourites from another thread just two days ago.

https://news.ycombinator.com/item?id=44971254

This is the rejected proposal.

https://www.unicode.org/L2/L2003/03215-n2593-umlaut-trema.pd...

If you read thread from above you will find more examples from other people.

pas 3 days ago | parent [-]

thanks! very interesting!

ah, and now I understand what the hell people mean when they put dots on coordinate! (but they are obviously wrong they should use the flying point from Catalan :)

... hm, so this issue is easily more than 20 years old. and since then there's no solution (or the German libraries consider the problem "solved" and ... no one else is making proposals to the WG about this nowadays)?

also, technically - since there are already more than 150K allocated code points - adding a different combining mark seems the correct way to do, right?

or it's now universally accepted that people who want to type ambigüité need to remember to type U+034F before the ü? (... or, of course it's up to their editor/typesetter software to offer this distinction)

regarding the Han unification, is there some kind of effort to "fix" that? (adding language-start language-end markers perhaps? or virtual code points for languages to avoid the need for searching strings for the being-end markers?)

4 days ago | parent | prev | next [-]
[deleted]
pas 4 days ago | parent | prev [-]

sure, but they have both human and machine stuff in the same "universe" - again, sure, it made sense, but maybe it would make sense to have a parser that helps to recover "human stuff" from "machine gibberish" (ie. filter out the presentation and control stuff), but, but, of course some in-band logic makes sense, after all, for the combinations (diacritics, emoji skin color, and so on).

yk 4 days ago | parent | prev [-]

I would. The original sin of Unicode is really their manifold idea, at that point they stopped trying to write a string standard and started to become a kinda general description of how string standards should look like and hopefully string standards that more or less conform to this description are interoperable if you remember which direction "string".decode() and "string".encode() is.

estebank 4 days ago | parent | prev | next [-]

The security concerns are those of "Trojan source", where the displayed text doesn't correspond to the bytes on the wire.[1]

I don't think a wire protocol should necessarily restrict them, for the sake of compatibility with existing text corpus out there, but a fair observation.

1: https://trojansource.codes/

yencabulator 4 days ago | parent [-]

The enforcement is an app-level issue, depending on the semantics of the field. I agree it doesn't belong in the low-level transport protocol.

The rules for "username", "display name", "biography", "email address", "email body" and "contents of uploaded file with name foo.txt" are not all going to be the same.

Waterluvian 4 days ago | parent [-]

Can a regular expression be used to restrict Unicode chars like the ones described?

I’m imagining a listing of regex rules for the various gotchas, and then a validation-level use that unions the ones you want.

fluoridation 3 days ago | parent [-]

Why would you need a regular expression for that? It's just a list of characters.

Waterluvian 3 days ago | parent [-]

There’s cases where certain characters coming before or after others is what creates the issue.

arp242 4 days ago | parent | prev [-]

I always thought you kind of need those directional control characters to correctly render bidi text? e.g. if you write something in Hebrew but include a Latin word/name (or the reverse).

dcrazy 4 days ago | parent | next [-]

This is the job of the Bidi Algorithm: https://www.unicode.org/reports/tr9/

Of course, this is an “annex”, not part of the core Unicode spec. So in situations where you can’t rely on the presentation layer’s (correct) implementation of the Bidi algorithm, you can fall back to directional override/embedding characters.

acdha 4 days ago | parent [-]

Over the years I’ve run into a few situations where the rules around neutral characters didn’t produce the right result and so we had to use the override characters to force the correct display. It’s completely a niche but very handy when you are mixing quotes within a complex text.

layer8 4 days ago | parent | prev [-]

Read the parent’s link. The characters “to be avoided” are a particular special-purpose subset, not directional control characters in general.