The “invisible symbols” are necessary to correctly represent human language. For instance, one of the most infamous Unicode control characters — the right-to-left override — is required to correctly encode mixed Latin and Hebrew text [1], which are both scripts that you mentioned. Besides, ASCII has control characters as well.

The “colorful icons” are not part of Unicode. Emoji are just characters like any other. There is a convention that applications should display them as little coloured images, but this convention has evolved on its own.

If you say that Unicode is too expansive, you would have to make a decision to exclude certain types of human communication from being encodable. In my opinion, including everything without discrimination is much preferable here.

[1]: https://en.wikipedia.org/wiki/Right-to-left_mark#Example_of_...

▲

wruza 2 years ago | parent | next [-]

Copy this󠀠󠀼󠀼󠀼󠀠󠁉󠁳󠀠󠁴󠁨󠁩󠁳󠀠󠁮󠁥󠁣󠁥󠁳󠁳󠁡󠁲󠁹󠀠󠁴󠁯󠀠󠁣󠁯󠁲󠁲󠁥󠁣󠁴󠁬󠁹󠀠󠁲󠁥󠁰󠁲󠁥󠁳󠁥󠁮󠁴󠀠󠁨󠁵󠁭󠁡󠁮󠀠󠁬󠁡󠁮󠁧󠁵󠁡󠁧󠁥󠀿󠀠󠀾󠀾󠀾 sentence into this site and click Decode. (YMMW)

https://embracethered.com/blog/ascii-smuggler.html

▲

hnuser123456 2 years ago | parent | next [-]

Wow. Did not expect you can just hide arbitrary data inside totally normal looking strings like that. If I select up to "Copy thi" and decode, there's no hidden string, but just holding shift+right arrow to select just "one more character", the "s" in "this", the hidden string comes along.

	▲	Izkata 2 years ago \| parent [-]
		Based on vim's word wrapping (which shows blank spaces instead of completely hiding it), they're being rendered at the end of the line. So if that is accurate it kind of makes sense that for UI-based interactions to be a one-character offset.

▲

n2d4 2 years ago | parent | prev [-]

> Is this necessary to correctly represent human language?

Yes! As soon as you have any invisible characters (eg. RTL or LTR marks, which are required to represent human language), you will be able to encode any data you want.

▲

wruza 2 years ago | parent [-]

How many direction marks can we see in this hidden text?

▲

n2d4 2 years ago | parent [-]

None — it's tag characters instead, which are used to represent emojis. But there's no difference! Either you can smuggle text in Unicode, or you can't. It's quite binary, and you don't gain advantages from having "fewer ways" to smuggle text, but you certainly gain advantages from having emojis in your characterset.

	▲	wruza 2 years ago \| parent [-]
		This yesman attitude is honestly unnerving. We make things worse, because they were worse! Sort of not, but were anyway. That’s an advantage, nothing to see here! Instead of praising the advantages of going insane, let us better make (or at least strive for) a charset that makes subj work in practice, not on paper.

▲

bawolff 2 years ago | parent | prev | next [-]

> one of the most infamous Unicode control characters — the right-to-left override

You are linking to an RLM not an RLO. Those are different characters. RLO is generally not needed and more special purpose. RLM causes much less problems than RLO.

Really though, i feel like the newer "first strong isolate" character is much better designed and easier to understand then most of the other rtl characters.

▲

n2d4 2 years ago | parent | prev | next [-]

Granted, technically speaking emojis are not part of the "Unicode Standard", but they are standardized by the Unicode Consortium and constitute "Unicode Technical Standard #51": https://www.unicode.org/reports/tr51/

▲

Y_Y 2 years ago | parent | prev | next [-]

I'm happy to discriminate against those damn ancient Sumerians and anyone still using goddamn Linear B.

▲

Analemma_ 2 years ago | parent | next [-]

Sure, but removing those wouldn't make Unicode any simpler, they're just character sets. The GP is complaining about things like combining characters and diacritic modifiers, which make Unicode "ugly" but are necessary if you want to represent real languages used by billions of people.

▲

wruza 2 years ago | parent | next [-]

I’m actually complaining about more “advanced” features like hiding text (see my comment above) or zalgoing it.

And of course endless variations of skin color and gender of three people in a pictogram of a family or something, which is purely a product of a specific subculture that doesn’t have anything in common with text/charset.

If unicode cared about characters, which happens to be an evolving but finite set, it would simply include them all, together with exactly two direction specifiers. Instead it created a language/format/tag system within itself to build characters most of which make zero sense to anyone in the world, except for grapheme linguists, if that job title even exists.

It will eventually overengineer itself into a set bigger than the set of all real characters, if not yet.

Practicality and implications of such system is clearly demonstrated by the $subj.

	▲	int_19h 2 years ago \| parent [-]
		"Zalgoing" text is just piling up combining marks, but there are plenty of real-world languages that require more than one combining mark per character to be properly spelled. Vietnamese is a rather extreme example.

▲

Y_Y 2 years ago | parent | prev | next [-]

You're right, of course. The point was glibly making was that Unicode has a lot of stuff in it, and you're not necessarily stomping on someone's ability to communicate by removing part if it.

I'm also concerned by having to normalize representations that use combining character etc. but I will add that there are assumptions that you can break just by including weird charsets.

For example the space character in Ogham, U+1680 is considered whitespace, but may not be invisible, ultimately because of the mechanics of writing something that's like the branches coming off a tree though carved around a large stone. That might be annoying to think about when you're designing a login page.

▲

Aeolun 2 years ago | parent | prev [-]

I mean, we can just make the languages simpler? We can also remove all the hundred different ways to pronounce English sounds. All elementary students will thank you for it xD

▲

scripturial 2 years ago | parent [-]

You can make a language simpler but old books still exist. I guess if we burn all old books and disallow a means to print these old books again, people would be happy?

▲

Aeolun 2 years ago | parent [-]

Reprint them with new spelling? We have 500 year old books that are unreadable. 99.99% of all books published will not be relevant to anyone that isn’t consuming them right at that moment anyway.

Lovers can read the lord of the rings in the ‘original’ spelling.

	▲	int_19h 2 years ago \| parent [-]
		The point is that you still want the universal encoding to be able to represent such texts.

▲

gwervc 2 years ago | parent | prev [-]

People who should use Sumerian characters don't even use them, sadly. First probably because of habit with their transcription, but also because missing variants of characters mean lot of text couldn't be accurately represented. Also I'm downvoting you for discriminating me.

	▲	Y_Y 2 years ago \| parent [-]
		I know you're being funny, but that's sort of the point. There's an important "use-mention" distinction when it comes to historical character sets. You surely could try to communicate in authentic Unicode Akkadian (𒀝𒅗𒁺𒌑(𒌝) but what's much more likely is that you really just want to refer to characters or short strings thereof while communicating anything else in a modern living language like English. I don't want to stop someone from trying to revive the language for fun or profit, but I think there's an important distinction between cases of primarily historical interest like that, and cases that are awkward but genuine like Inuktut.

▲

account42 2 years ago | parent | prev [-]

> The “colorful icons” are not part of Unicode. Emoji are just characters like any other. There is a convention that applications should display them as little coloured images, but this convention has evolved on its own.

Ok now you're just full of shit and you know it.