Remix.run Logo
wruza 10 months ago

I'll say it again: this is the consequence of Unicode trying to be a mix of html and docx, instead of a charset. It went too far for an average Joe DevGuy to understand how to deal with it, so he just selects a subset he can handle and bans everything else. HN does that too - special symbols simply get removed.

Unicode screwed itself up completely. We wanted a common charset for things like latin, extlatin, cjk, cyrillic, hebrew, etc. And we got it, for a while. Shortly after it focused on becoming a complex file format with colorful icons and invisible symbols, which is not manageable without cutting out all that bs by force.

meew0 10 months ago | parent | next [-]

The “invisible symbols” are necessary to correctly represent human language. For instance, one of the most infamous Unicode control characters — the right-to-left override — is required to correctly encode mixed Latin and Hebrew text [1], which are both scripts that you mentioned. Besides, ASCII has control characters as well.

The “colorful icons” are not part of Unicode. Emoji are just characters like any other. There is a convention that applications should display them as little coloured images, but this convention has evolved on its own.

If you say that Unicode is too expansive, you would have to make a decision to exclude certain types of human communication from being encodable. In my opinion, including everything without discrimination is much preferable here.

[1]: https://en.wikipedia.org/wiki/Right-to-left_mark#Example_of_...

wruza 10 months ago | parent | next [-]

Copy this󠀠󠀼󠀼󠀼󠀠󠁉󠁳󠀠󠁴󠁨󠁩󠁳󠀠󠁮󠁥󠁣󠁥󠁳󠁳󠁡󠁲󠁹󠀠󠁴󠁯󠀠󠁣󠁯󠁲󠁲󠁥󠁣󠁴󠁬󠁹󠀠󠁲󠁥󠁰󠁲󠁥󠁳󠁥󠁮󠁴󠀠󠁨󠁵󠁭󠁡󠁮󠀠󠁬󠁡󠁮󠁧󠁵󠁡󠁧󠁥󠀿󠀠󠀾󠀾󠀾 sentence into this site and click Decode. (YMMW)

https://embracethered.com/blog/ascii-smuggler.html

hnuser123456 10 months ago | parent | next [-]

Wow. Did not expect you can just hide arbitrary data inside totally normal looking strings like that. If I select up to "Copy thi" and decode, there's no hidden string, but just holding shift+right arrow to select just "one more character", the "s" in "this", the hidden string comes along.

Izkata 10 months ago | parent [-]

Based on vim's word wrapping (which shows blank spaces instead of completely hiding it), they're being rendered at the end of the line. So if that is accurate it kind of makes sense that for UI-based interactions to be a one-character offset.

n2d4 10 months ago | parent | prev [-]

> Is this necessary to correctly represent human language?

Yes! As soon as you have any invisible characters (eg. RTL or LTR marks, which are required to represent human language), you will be able to encode any data you want.

wruza 10 months ago | parent [-]

How many direction marks can we see in this hidden text?

n2d4 10 months ago | parent [-]

None — it's tag characters instead, which are used to represent emojis. But there's no difference! Either you can smuggle text in Unicode, or you can't. It's quite binary, and you don't gain advantages from having "fewer ways" to smuggle text, but you certainly gain advantages from having emojis in your characterset.

wruza 10 months ago | parent [-]

This yesman attitude is honestly unnerving.

We make things worse, because they were worse! Sort of not, but were anyway. That’s an advantage, nothing to see here!

Instead of praising the advantages of going insane, let us better make (or at least strive for) a charset that makes subj work in practice, not on paper.

bawolff 10 months ago | parent | prev | next [-]

> one of the most infamous Unicode control characters — the right-to-left override

You are linking to an RLM not an RLO. Those are different characters. RLO is generally not needed and more special purpose. RLM causes much less problems than RLO.

Really though, i feel like the newer "first strong isolate" character is much better designed and easier to understand then most of the other rtl characters.

n2d4 10 months ago | parent | prev | next [-]

Granted, technically speaking emojis are not part of the "Unicode Standard", but they are standardized by the Unicode Consortium and constitute "Unicode Technical Standard #51": https://www.unicode.org/reports/tr51/

Y_Y 10 months ago | parent | prev | next [-]

I'm happy to discriminate against those damn ancient Sumerians and anyone still using goddamn Linear B.

Analemma_ 10 months ago | parent | next [-]

Sure, but removing those wouldn't make Unicode any simpler, they're just character sets. The GP is complaining about things like combining characters and diacritic modifiers, which make Unicode "ugly" but are necessary if you want to represent real languages used by billions of people.

wruza 10 months ago | parent | next [-]

I’m actually complaining about more “advanced” features like hiding text (see my comment above) or zalgoing it.

And of course endless variations of skin color and gender of three people in a pictogram of a family or something, which is purely a product of a specific subculture that doesn’t have anything in common with text/charset.

If unicode cared about characters, which happens to be an evolving but finite set, it would simply include them all, together with exactly two direction specifiers. Instead it created a language/format/tag system within itself to build characters most of which make zero sense to anyone in the world, except for grapheme linguists, if that job title even exists.

It will eventually overengineer itself into a set bigger than the set of all real characters, if not yet.

Practicality and implications of such system is clearly demonstrated by the $subj.

int_19h 10 months ago | parent [-]

"Zalgoing" text is just piling up combining marks, but there are plenty of real-world languages that require more than one combining mark per character to be properly spelled. Vietnamese is a rather extreme example.

Y_Y 10 months ago | parent | prev | next [-]

You're right, of course. The point was glibly making was that Unicode has a lot of stuff in it, and you're not necessarily stomping on someone's ability to communicate by removing part if it.

I'm also concerned by having to normalize representations that use combining character etc. but I will add that there are assumptions that you can break just by including weird charsets.

For example the space character in Ogham, U+1680 is considered whitespace, but may not be invisible, ultimately because of the mechanics of writing something that's like the branches coming off a tree though carved around a large stone. That might be annoying to think about when you're designing a login page.

Aeolun 10 months ago | parent | prev [-]

I mean, we can just make the languages simpler? We can also remove all the hundred different ways to pronounce English sounds. All elementary students will thank you for it xD

scripturial 10 months ago | parent [-]

You can make a language simpler but old books still exist. I guess if we burn all old books and disallow a means to print these old books again, people would be happy?

Aeolun 10 months ago | parent [-]

Reprint them with new spelling? We have 500 year old books that are unreadable. 99.99% of all books published will not be relevant to anyone that isn’t consuming them right at that moment anyway.

Lovers can read the lord of the rings in the ‘original’ spelling.

int_19h 10 months ago | parent [-]

The point is that you still want the universal encoding to be able to represent such texts.

gwervc 10 months ago | parent | prev [-]

People who should use Sumerian characters don't even use them, sadly. First probably because of habit with their transcription, but also because missing variants of characters mean lot of text couldn't be accurately represented. Also I'm downvoting you for discriminating me.

Y_Y 10 months ago | parent [-]

I know you're being funny, but that's sort of the point. There's an important "use-mention" distinction when it comes to historical character sets. You surely could try to communicate in authentic Unicode Akkadian (𒀝𒅗𒁺𒌑(𒌝) but what's much more likely is that you really just want to refer to characters or short strings thereof while communicating anything else in a modern living language like English. I don't want to stop someone from trying to revive the language for fun or profit, but I think there's an important distinction between cases of primarily historical interest like that, and cases that are awkward but genuine like Inuktut.

account42 10 months ago | parent | prev [-]

> The “colorful icons” are not part of Unicode. Emoji are just characters like any other. There is a convention that applications should display them as little coloured images, but this convention has evolved on its own.

Ok now you're just full of shit and you know it.

n2d4 10 months ago | parent | prev | next [-]

> and invisible symbols

Invisible symbols were in Unicode before Unicode was even a thing (ASCII already has a few). I also don't think emojis are the reason why devs add checks like in the OP, it's much more likely that they just don't want to deal with character encoding hell.

As much as devs like to hate on emojis, they're widely adopted in the real world. Emojis are the closest thing we have to a universal language. Having them in the character encoding standard ensures that they are really universal, and supported by every platform; a loss for everyone who's trying to count the number of glyphs in a string, but a win for everyone else.

account42 10 months ago | parent [-]

> Emojis are the closest thing we have to a universal language.

What meaning does U+1F52B have? What about U+1F346?

A set of glyphs does not make a language.

jrochkind1 10 months ago | parent | prev | next [-]

Unicode has metadata on each character that would allow software to easily strip out or normalize emoji's and "decorative" characters.

It might have edge case problems -- but the charcters in the OP's name would not be included.

Also, stripping out emoji's may not actually be required or the right solution. If security is the concern, Unicode also has recommended processes and algorithms for dealing with that.

https://www.unicode.org/reports/tr39/

We need better support for the functions developers actually need on unicode in more platforms and languages.

Global human language is complicated as a domain. Legacy issues in actually existing data adds to the complexity. Unicode does a pretty good job at it. It's actually pretty amazing how well it does. Including a lot more than just the character set, and encoding, but algorithms for various kinds of normalizing, sorting, indexing, under various localizations, etc.

It needs better support in the environments more developers are working in, with raised-to-the-top standard solutions for identified common use cases and problems, that can be implemented simply by calling a performance-optimized library function.

(And, if we really want to argue about emoji's, they seem to be extremely popular, and literally have effected global culture, because people want to use them? Blaming emoji's seems like blaming the user! Unicode's support for them actually supports interoperability and vendor-neutral standards for a thing that is wildly popular? but I actually don't think any of the problems or complexity we are talking about, including the OP's complaint, can or should be laid at the feet of emojis)

10 months ago | parent [-]
[deleted]
kristopolous 10 months ago | parent | prev | next [-]

There's no argument here.

We could say it's only for script and alphabets, ok. It includes many undeciphered writing systems from antiquity with only a small handful of extent samples.

Should we keep that, very likely to never be used character set, but exclude the extremely popular emojis?

Exclude both? Why? Aren't computers capable enough?

I used to be on the anti emoji bandwagon but really, it's all indefensible. Unicode is characters of communication at an extremely inclusive level.

I'm sure some day it will also have primitive shapes and you can construct your own alphabet using them + directional modifiers akin to a generalizable Hangul in effect becoming some kind of wacky version of svg that people will abuse it in an ASCII art renaissance.

So be it. Sounds great.

simonh 10 months ago | parent | next [-]

No, no, no, no, no… So then we’d get ‘the same’ character with potentially infinite different encodings. Lovely.

Unicode is a coding system, not a glyph system or font.

kristopolous 10 months ago | parent | next [-]

Fonts are already in there and proto-glyphs are too as generalized dicritics. There's also a large variety of generic shapes, lines, arrows, circles and boxes in both filled and unfilled varieties. Lines even have different weights. The absurdity of a custom alphabet can already be partially actualized. Formalism is merely the final step

This conversation was had 20 years ago and your (and my) position lost. Might as well embrace the inevitable instead of insisting on the impossible.

Whether you agree with it or not won't actually affect unicode's outcome, only your own.

simonh 10 months ago | parent | next [-]

Unicode does not specify any fonts, though many fonts are defined to be consistent with the Unicode standard, nevertheless they are emphatically not part of Unicode.

How symbols including diacritics are drawn and displayed is not a concern for Unicode, different fonts can interpret 'filled circle' or the weight of a glyph as they like, just as with emoji. By convention they generally adopt common representations but not entirely. For example try using the box drawing characters from several different fonts together. Some work, many don't.

kristopolous 10 months ago | parent [-]

You can say things like the different "styles" that exploit Unicode on a myriad of websites such as https://qaz.wtf/u/convert.cgi?text=Hello are not technically "fonts" but it's a distinction without a meaningful difference. You have script, fraktur, bold, monospace, italic...

simonh 10 months ago | parent [-]

Fraktur is interesting because it’s more a writing style, verging in a character set in its own right. However Unicode doesn’t directly support all of its ligatures and such.

None of this is in any way justification for turning Unicode into something like SVG. Even the pseudo-drawing capabilities it does have are largely for legacy reasons.

kristopolous 10 months ago | parent [-]

Fraktur at one point was genuinely a different script

You can find texts in the late 1500-early 1900s at least that will switch to a fraktur style when quoting or using German.

ANSI escape codes even accommodates for it. Codepoint 20: https://en.m.wikipedia.org/wiki/ANSI_escape_code#Select_Grap...

Don't ask me why, I only work here.

See also https://en.wikipedia.org/wiki/Antiqua%E2%80%93Fraktur_disput...

I also don't find any of my predictions defensible as much as I believe they're inevitable. Again I've got no agency here.

jrochkind1 10 months ago | parent | prev [-]

The diacritics are there because they were in legacy encodings, and it was decided at some point that encodings should be round-trippable between legacy encodings and unicode.

The fact that hardly anyone cares any longer are about going to any legacy non-unicode encoding is, of course, a testament to the success of unicode, a success that required not only technical excellence but figuring out what would actually work for people to actually adopt practically. It worked. It's adopted.

I have no idea if the diacritics choice was the right one or not, but I guarantee if it had been made differently people would be complaining about how things aren't round-trippable to unicode encoding and back from some common legacy encoding, and that's part of it's problem.

I think some combining diacritics are also necessary for some non-latin scripts, where it is (or was) infeasible to have a codepoint for every possible combination.

The choices in Unicode are not random. The fact that it has become universal (so many attempts at standards have not) is a pretty good testatement to it's success at balancing a bunch of competing values and goals.

int_19h 10 months ago | parent [-]

It's the other way around - precombined characters (with diacritics) are there because they were in legacy encodings. But, assuming that by "generalized diacritics" OP means Unicode combining characters like U+0301, there's nothing legacy about them; on the contrary, the intent is to prefer them over precombined variants, which is why new precombined glyphs aren't added.

jrochkind1 10 months ago | parent [-]

Ah, right, thanks!

numpad0 10 months ago | parent | prev [-]

macOS already does different encoding for filenames in Japanese than what Windows/Linux do, and I'm sure someone mentioned same situation in Korean here.

Unicode is already a non-deterministic mess.

simonh 10 months ago | parent [-]

And that justifies making it an even more complete mess, in new and dramatically worse ways?

riwsky 10 months ago | parent | prev [-]

Like how phonetic alphabets save space compared to ideograms by just “write the word how it sounds”, the little SVG-icode would just “write the letter how it’s drawn”

kristopolous 10 months ago | parent | next [-]

Right. Semantic iconography need not be universal or even formal to be real.

Think of all the symbols notetakers invent; ideographs without even phonology assigned to it.

Being as dynamic as flexible as human expression is hard.

Emojis have even taken on this property naturally. The high-5 is also the praying hands for instance. Culturally specific semantics are assigned to the variety of shapes, such as the eggplant and peach.

Insisting that this shouldn't happen is a losing battle against how humans construct written language. Good luck with that.

10 months ago | parent | prev [-]
[deleted]
bawolff 10 months ago | parent | prev | next [-]

There are no emoiji in this guy's name.

Unicode has made some mistakes, but having all the symbols necessary for this guy's name is not one of them.

asddubs 10 months ago | parent | prev | next [-]

>We wanted a common charset for things like latin, extlatin, cjk, cyrillic, hebrew, etc. And we got it, for a while.

we didn't even get that because slightly different looking characters from japanese and chinese (and other languages) got merged to be the same character in unicode due to having the same origin, meaning you have to use a font based on the language context for it to display correctly.

tadfisher 10 months ago | parent [-]

They are the same character, though. They do not use the same glyph in different language contexts, but Unicode is a character encoding, not a font standard.

numpad0 10 months ago | parent | next [-]

They're not. Readers native in one version can't read the other, and there are more than handful that got duplicated in multiple forms, so they're just not same, just similar.

You know, obvious presumption underlying Han Unification is that CJK languages must have a continuous dialect continuums, like villagers living in the middle of East China Sea between Shanghai and Nagasaki and Gwangju would speak half-Chinese-Japanese-Korean, and technical distinction only exist because of rivalry or something.

Alas, people don't really erect a house on the surface of an ocean, and CJK languages are each complete isolates with no known shared ancestries, so "it's gotta be all the same" thinking really don't work.

I know it's not very intuitive to think that Chinese and Japanese has ZERO syntactic similarity or mutual intelligibility despite relatively tiny mental shares they occupy, but it's just how things are.

tadfisher 10 months ago | parent [-]

You're making the same mistake: the languages are different, but the script is the same (or trivially derived from the Han script). The Ideographic Research Group was well aware of this, having consisted of native speakers of the languages in question.

numpad0 10 months ago | parent [-]

That's not "mistake", that's the reality. They don't exchange, and they're not the same. "Same or trivially derived" is just a completely false statement that solely exist to justify Han Unification, or maybe something that made sense in the 80s, it doesn't make literal sense.

tadfisher 10 months ago | parent [-]

> "Same or trivially derived" is just a completely false statement

You'd have to ignore a lot of reality to believe this. It's even in the names of the writing systems: Kanji, Hanja, Chữ Hán. Of course they don't exchange, because they don't carry the same meaning, just as the word "chat" means completely different things in French and English. But it is literally the same script, albeit with numerous stylistic differences and simplified forms.

numpad0 10 months ago | parent [-]

CJK native speakers can't read or write other "trivially derived" versions of Hanzi. I don't understand why this has to be reiterated ad infinitum.

We can't actually read Simplified Chinese as a native Japanese just like French speakers can't exactly read Cyrillic, only recognize some of it. Therefore those are different alphabet sets. Simple as that.

The "trivially derived different styles" justification assumes that to be false, that native users of all 3 major styles of Hanzi can write, at least read, the other two styles without issues. That is not true.

Итъс а реал проблем то бе cонстантлй пресентед wитҳ чараcтерс тҳат И жуст cанът реад он тҳе гроунд тҳат тҳейъре "саме".

I hope you don't get offended by the line before this, because that's "same" latin, isn't it?

Muromec 10 months ago | parent | prev | next [-]

Yes, but the same is true for overlapping characters in Cyrillic and Latin. A and А are the same glyph, so are т,к,і and t,k,i and you can even see the difference between some of those.

tadfisher 10 months ago | parent [-]

The duplication there is mostly to remain compatible or trivially transformable with existing encodings. Ironically, the two versions of your example "A" do look different on my device (Android), with a slightly lower x-height for the Cyrillic version.

numpad0 10 months ago | parent [-]

The irony is you calling it irony. CJK "the same or trivially derived" characters are nowhere close to that yet given same code points. CJK unified ideographs is just broken.

kmeisthax 10 months ago | parent | prev | next [-]

So when are we getting UniPhoenician?

lmm 10 months ago | parent | prev | next [-]

This is a bullshit argument that never gets applied to any other live language. The characters are different, people who actually use them in daily life recognise them as conveying different things. If a thumbs up with a different skin tone is a different character then a different pattern of lines is definitely a different character.

Dylan16807 10 months ago | parent [-]

> If a thumbs up with a different skin tone is a different character

Is it? The skin tone modifier is serving the same purpose as a variant selector for the CJK codepoint would be.

lmm 10 months ago | parent [-]

The underlying implementation mechanism is not the issue. If unicode had actual support for Japanese characters so that when one e.g. converted text from Shift-JIS (in the default, supported way) one could be confident that one's characters would not change into different characters, I wouldn't be complaining, whether the implementation mechanism involved variant selectors or otherwise.

Dylan16807 10 months ago | parent [-]

Okay, that's fair. The support for the selectors is very half-assed and there's no other good mechanism.

asddubs 10 months ago | parent | prev [-]

It doesn't matter to me what bullshit semantics theoretical excuse there is, for practical purposes it means that UTF-8 is insufficient for displaying any human language, especially if you want chinese and japanese in the same document/context without switching fonts (like, say, a website)

virexene 10 months ago | parent | prev | next [-]

in what way is unicode similar to html, docx, or a file format? the only features I can think of that are even remotely similar to what you're describing are emoji modifiers.

and no, this webpage is not result of "carefully cutting out the complicated stuff from Unicode". i'm pretty sure it's just the result of not supporting Unicode in any meaningful way.

mason_mpls 10 months ago | parent | prev | next [-]

This frustration seems unnecessary, unicode isnt more complicated than time and we have far more than enough processing power to handle its most absurd manifestations.

We just need good libraries, which is a lot less work than inventing yet another system.

arka2147483647 10 months ago | parent [-]

The limiting factor is not compute power, but the time and understanding of a random dev somewhere.

Time also is not well understood by most programmers. Most just seem to convert it to epoch and pretend that it is continuous.

numpad0 10 months ago | parent | prev | next [-]

IMO, the sin of Unicode is that they didn't just pick local language authorities and gave them standardized concepts like lines and characters, and start-of-language and end-of-language markers.

Lots of Unicode issues come from handling languages that the code is not expecting, and codes currently has no means to select or report quirk supports.

I suppose they didn't like getting national borders involved in technical standardization bit that's just unavoidable. It is getting involved anyway, and these problems are popping up anyway.

kmeisthax 10 months ago | parent | next [-]

This doesn't self-synchronize. Removing an arbitrary byte from the text stream (e.g. SOL / EOL) will change the meaning of codepoints far away from the site of the corruption.

What it sounds like you want is an easy way for English-language programmers to skip or strip non-ASCII text without having to reference any actual Unicode documentation. Which is a Unicode non-goal, obviously. And also very bad software engineering practice.

I'm also not sure what you're getting at with national borders and language authorities, but both of those were absolutely involved with Unicode and still are.

kevin_thibedeau 10 months ago | parent | prev | next [-]

> start-of-language and end-of-language markers

Unicode used to have language tagging but they've been (mostly) deprecated:

https://en.wikipedia.org/wiki/Tags_(Unicode_block)

https://www.unicode.org/reports/tr7/tr7-1.html

anonymoushn 10 months ago | parent [-]

The lack of such markers prevents Unicode from encoding strings of mixed Japanese and Chinese text correctly. Or in the case of a piece of software that must accept both Chinese and Japanese names for different people, Unicode is insufficient for encoding the written forms of the names.

numpad0 10 months ago | parent [-]

Just in case this has to be said: the reason this hasn't been a problem in the past is because you could solve this problem by picking a team and completely breaking support for the others.

With rapidly improving single-image i18n in OS and apps, "breaking support for the others" slowly became non-ideal or hardly viable solution, and the problem surfaced.

layer8 10 months ago | parent | prev [-]

I’m working with Word documents in different languages, and few people take the care to properly tag every piece of text with the correct language. What you’re proposing wouldn’t work very well in practice.

The other historical background is that when Unicode was designed, many national character sets and encodings existed, and Unicode’s purpose was to serve as a common superset of those, as otherwise you’d need markers when switching between encodings. So the existing encodings needed to be easily convertible to Unicode (and back), without markers, for Unicode to have any chance of being adopted. This was the value proposition of Unicode, to get rid of the case distinctions between national character sets as much as possible. As a sibling comment notes, originally there were also optional language markers, which however nobody used.

throwaway290 10 months ago | parent | prev | next [-]

I bet the complex file format thing probably started at CJK. They wanted to compose Hangul and later someone had a bright idea to do the same to change the look of emojis.

Don't worry, AI is the new hotness. All they need is unpack prompts into arbitrary images and finally unicode is truly unicode, all our problems will be solved forever

Muromec 10 months ago | parent | prev | next [-]

>so he just selects a subset he can handle and bans everything else.

Yes? And the problem is?

wruza 10 months ago | parent | next [-]

The problem is the scale at which it happens and lack of methods-to-go in most runtimes/libs. No one and nothing is ready for unicode complexity out of box, and there's little interest in unscrewing it by oneself, cause it looks like an absurd minefield and likely is one, from the persepective of an average developer. So they get defensive by default, which results in $subj.

throwaway290 10 months ago | parent | prev [-]

The next guy with a different subset? :)

Muromec 10 months ago | parent [-]

The subset is mostly defined by the jurisdiction you operate in, which usually defines a process to map names from one subset to another and is also in the business of keeping the log of said operation. The problem is not operating in a subset, but defining it wrong and not being aware there are multiple of those.

If different parts of your system operate in different jurisdictions (or interface which other systems that do), you have to pick multiple subsets and ask user to provide input for each of them.

You just can't put anything other than ASCII into either payment card or PNR and the rules of minimal length will differ for the two and you can't put ASCII into the government database which explicitly rejects all of ASCII letters.

throwaway290 10 months ago | parent [-]

HN does not accept emoji because of jurisdiction huh?

Muromec 10 months ago | parent [-]

That depends on what political philosophy you follow -- they either do, or are wrong and mean.

throwaway290 10 months ago | parent [-]

I was being sarcastic.

As top comment said if Unicode was not a joke and epitomization of feature creep this would be a non issue.

jojobas 10 months ago | parent | prev [-]

But hey, multiocular o!

https://en.wikipedia.org/wiki/Cyrillic_O_variants#Multiocula...

(TL;DR a bored scribe's doodle has a code point)

account42 10 months ago | parent [-]

Huh, wasn't aware of this update.

> The character was proposed for inclusion into Unicode in 2007 and incorporated as character U+A66E in Unicode version 5.1 (2008). The representative glyph had seven eyes and sat on the baseline. However, in 2021, following a tweet highlighting the character, it came to linguist Michael Everson's attention that the character in the 1429 manuscript was actually made up of ten eyes. After a 2022 proposal to change the character to reflect this, it was updated later that year for Unicode 15.0 to have ten eyes and to extend below the baseline. However, not all fonts support the ten-eyed variant as of November 2024.

So not only did they arbitrarily add a non-character (while arbitrarily combinding other real characters) but they didn't even get the glyph right and then changed it to actually match the arbitrary doodle.