I’m frustrated by things like Unicode where it’s “good” except… you need to know to exclude some of them. Unicode feels like a wild jungle of complexity. An understandable consequence of trying to formalize so many ways to write language. But it really sucks to have to reason about some characters being special compared to others.

The only sanity I’ve found is to treat Unicode strings as if they’re some proprietary data unit format. You can accept them, store them, render them, and compare them with each other for (data, not semantic) equality. But you just don’t ever try to reason about their content. Heck I’m not even comfortable trying to concatenate them or anything like that.

▲ csande17 4 days ago | parent | next [-]

Unicode really is an impossibly bottomless well of trivia and bad decisions. As another example, the article's RFC warns against allowing legacy ASCII control characters on the grounds that they can be confusing to display to humans, but says nothing about the Explicit Directional Overrides characters that https://www.unicode.org/reports/tr9/#Explicit_Directional_Ov... suggests should "be avoided wherever possible, because of security concerns".

▲

weinzierl 4 days ago | parent | next [-]

I wouldn’t be so harsh. I think the Unicode Consortium not only started with good intentions but also did excellent work for the first decade or so.

I just think they got distracted when the problems got harder, and instead of tackling them head-on, they now waste a lot of their resources on busywork - good intentions notwithstanding. Sure, it’s more fun standardizing sparkling disco balls than dealing with real-world pain points. That OpenType is a good and powerful standard which masks some of Unicode’s shortcomings doesn’t really help.

It’s not too late, and I hope they will find their way back to their original mission and be braver in solving long-standing issues.

▲

zahlman 4 days ago | parent | next [-]

A big part of the problem is that the reaction to early updates was so bad that they promised they would never un-assign or re-assign a code point ever again, making it impossible for them to actually correct any mistakes (not even typos in the official standard names given to characters).

The versioning is actually almost completely backwards by semver reasoning; 1.1 should have been 2.0, 2.0 should have been 3.0 and we should still be on 3.n now (since they have since kept the promise not to remove anything).

▲

socalgal2 4 days ago | parent | prev | next [-]

What could be better? Human languages are complex

▲

weinzierl 4 days ago | parent | next [-]

Yes, exactly, human languages are complex and in my opinion Unicode used to be on a good track to tackle these complexities. I just think that nowadays they are not doing enough to help people around the world solving these problems.

▲

pas 4 days ago | parent [-]

can you describe a few examples? what are you missing? or maybe are you aware of something they rejected that would be useful?

▲

weinzierl 4 days ago | parent [-]

The elephant in the room is Han Unification but there are plenty of other issues. Here is one of my favourites from another thread just two days ago.

https://news.ycombinator.com/item?id=44971254

This is the rejected proposal.

https://www.unicode.org/L2/L2003/03215-n2593-umlaut-trema.pd...

If you read thread from above you will find more examples from other people.

	▲	pas 3 days ago \| parent [-]
		thanks! very interesting! ah, and now I understand what the hell people mean when they put dots on coordinate! (but they are obviously wrong they should use the flying point from Catalan :) ... hm, so this issue is easily more than 20 years old. and since then there's no solution (or the German libraries consider the problem "solved" and ... no one else is making proposals to the WG about this nowadays)? also, technically - since there are already more than 150K allocated code points - adding a different combining mark seems the correct way to do, right? or it's now universally accepted that people who want to type ambigüité need to remember to type U+034F before the ü? (... or, of course it's up to their editor/typesetter software to offer this distinction) regarding the Han unification, is there some kind of effort to "fix" that? (adding language-start language-end markers perhaps? or virtual code points for languages to avoid the need for searching strings for the being-end markers?)

▲

4 days ago | parent | prev | next [-]

[deleted]

▲

pas 4 days ago | parent | prev [-]

sure, but they have both human and machine stuff in the same "universe" - again, sure, it made sense, but maybe it would make sense to have a parser that helps to recover "human stuff" from "machine gibberish" (ie. filter out the presentation and control stuff), but, but, of course some in-band logic makes sense, after all, for the combinations (diacritics, emoji skin color, and so on).

▲

yk 4 days ago | parent | prev [-]

I would. The original sin of Unicode is really their manifold idea, at that point they stopped trying to write a string standard and started to become a kinda general description of how string standards should look like and hopefully string standards that more or less conform to this description are interoperable if you remember which direction "string".decode() and "string".encode() is.

▲

estebank 4 days ago | parent | prev | next [-]

The security concerns are those of "Trojan source", where the displayed text doesn't correspond to the bytes on the wire.[1]

I don't think a wire protocol should necessarily restrict them, for the sake of compatibility with existing text corpus out there, but a fair observation.

1: https://trojansource.codes/

▲

yencabulator 4 days ago | parent [-]

The enforcement is an app-level issue, depending on the semantics of the field. I agree it doesn't belong in the low-level transport protocol.

The rules for "username", "display name", "biography", "email address", "email body" and "contents of uploaded file with name foo.txt" are not all going to be the same.

▲

Waterluvian 4 days ago | parent [-]

Can a regular expression be used to restrict Unicode chars like the ones described?

I’m imagining a listing of regex rules for the various gotchas, and then a validation-level use that unions the ones you want.

▲

fluoridation 3 days ago | parent [-]

Why would you need a regular expression for that? It's just a list of characters.

	▲	Waterluvian 3 days ago \| parent [-]
		There’s cases where certain characters coming before or after others is what creates the issue.

▲

arp242 4 days ago | parent | prev [-]

I always thought you kind of need those directional control characters to correctly render bidi text? e.g. if you write something in Hebrew but include a Latin word/name (or the reverse).

▲

dcrazy 4 days ago | parent | next [-]

This is the job of the Bidi Algorithm: https://www.unicode.org/reports/tr9/

Of course, this is an “annex”, not part of the core Unicode spec. So in situations where you can’t rely on the presentation layer’s (correct) implementation of the Bidi algorithm, you can fall back to directional override/embedding characters.

	▲	acdha 4 days ago \| parent [-]
		Over the years I’ve run into a few situations where the rules around neutral characters didn’t produce the right result and so we had to use the override characters to force the correct display. It’s completely a niche but very handy when you are mixing quotes within a complex text.

▲

layer8 4 days ago | parent | prev [-]

Read the parent’s link. The characters “to be avoided” are a particular special-purpose subset, not directional control characters in general.

▲ eviks 4 days ago | parent | prev | next [-]

Indeed, though a lot of that complexity like surrogates and control codes aren't due to attempts to write language, that's just awful designs preserved for posterity

▲ Etheryte 4 days ago | parent | prev | next [-]

As a simple example off the top of my head, if the first string ends in an orphaned emoji modifier and the second one starts with a modifiable emoji, you're already going to have trouble. It's only downhill from there with more exotic stuff.

▲ kps 4 days ago | parent [-]

Unicode combining/modifying/joining characters should have been prefix rather than suffix/infix, in blocks by arity.

▲ layer8 4 days ago | parent | next [-]

One benefit of the suffix convention is that strings sort more usefully that way by default, without requiring special handling for those characters.

Unicode 1.0 also explains: “The convention used by the Unicode standard is consistent with the logical order of other non-spacing marks in Semitic and Indic scripts, the great majority of which follow the base characters with respect to which they are positioned. To avoid the complication of defining and implementing non-spacing marks on both sides of base characters, the Unicode standard specifies that all non-spacing marks must follow their base characters. This convention conforms to the way modern font technology handles the rendering of non-spacing graphical forms, so that mapping from character store to font rendering is simplified.”

▲ kps 4 days ago | parent [-]

Sorting is a good point.

On the other hand, prefix combining characters would have vastly simplified keyboard handling, since that's exactly what typewriter dead keys are.

▲ layer8 4 days ago | parent | next [-]

Keyboard input handling at that level generally isn’t character-based, and instead requires looking at scancodes and modifier keys, and sometimes also distinguishing between keyup and keydown events.

You generally also don’t want to produce different Unicode sequences depending on whether you have an “é” key you can press or have to use a dead-key “’”.

▲ kps 4 days ago | parent [-]

Depends on the system. X11/Wayland do it at a higher level where you have `<dead_acute> <e> : eacute` and keysyms are effectively a superset of Unicode with prefix combiners. (This can lead to weirdness since the choice of Compose rules is orthogonal to the choice of keyboard layout.)

▲ layer8 4 days ago | parent [-]

I guess your conception is that one could then define

    <dead_acute> : <combining_acute_accent>

instead and use it for arbitrary letters. However, that would fail in locales using a non-Unicode encoding such as iso-8859-1 that only contain the combined character. Unless you have the input system post-process the mapped input again to normalize it to e.g. NFC before passing it on to the application, in which case the combination has to be reparsed anyway. So I don’t see what would be gained with regard to ease of parsing.

If you want to define such a key, you can probably still do it, you’ll just have to press it in the opposite order and use backspace if you want to cancel it.

The fact that dead keys happen to be prefix is in principle arbitrary, they could as well be suffix. On physical typewriters, suffix was more customary I think, i.e. you’d backspace over the character you want to accent and type the accent on top of it. To obtain just the accent, you combine it with Space either way.

▲

moefh 3 days ago | parent [-]

> On physical typewriters, suffix was more customary I think

Why would anyone type like that? Instead of pressing two keys (the accent key followed by the letter key), you'd need to press four (letter, backspace, accent, space bar) for no reason.

	▲	kps 3 days ago \| parent [-]
		The mechanical typewriter dead key worked by omitting the linkage that advances the carriage. That established the method of pressing the dead key and then the accompanying letter.

▲ dcrazy 4 days ago | parent | prev [-]

Not all input methods use dead keys to emit combining characters.

▲ zahlman 4 days ago | parent | prev [-]

They should have at least all used a single system. Instead, we have:

* European-style combining characters, as well as precomposed versions for some arbitrary subset of legal combinations, and nothing preventing you from stacking them arbitrarily (as in Zalgo text) or on illogical base characters (who knows what your font renderer will do if you ask to put a cedilla on a kanji? It might even work!)

* Jamo for Hangul that are three pseudo-characters representing the parts of a larger character, that have to be in order (and who knows what you're supposed to do with an invalid jamo sequence)

* Emoji that are produced by applying a "variation selector" to a normal character

* Emoji that are just single characters — including ones that used to be normal characters and were retconned to now require the variation selector to get the original appearance

* Some subset of emoji that can have a skin-tone modifier applied as a direct suffix

* Some other subset of emoji that are formed by combining other emoji, which requires a zero-width-joiner in between (because they'd also be valid separately), which might be rendered as the base components anyway if no joined glyph is available

* National flags that use a pair of abstract characters used to spell a country code; neither can be said to be the base vs the modifier (this lets them say that they never removed or changed the meaning of a "character" while still allowing for countries to change their country codes, national flags or existence status)

* Other flags that use a base flag character, followed by "tag letter" characters that were originally intended for a completely different purpose that never panned out; and also there was temporary disagreement about which base character should be used

* Other other flags that are vendor-specific but basically work like emoji with ZWJ sequences

And surely more that I've forgotten about or not learned about yet.

▲ ivanjermakov 4 days ago | parent | prev [-]

Unicode sucks, but it sucks less than every other encoding standard.