Remix.run Logo
wruza 18 hours ago

I'll say it again: this is the consequence of Unicode trying to be a mix of html and docx, instead of a charset. It went too far for an average Joe DevGuy to understand how to deal with it, so he just selects a subset he can handle and bans everything else. HN does that too - special symbols simply get removed.

Unicode screwed itself up completely. We wanted a common charset for things like latin, extlatin, cjk, cyrillic, hebrew, etc. And we got it, for a while. Shortly after it focused on becoming a complex file format with colorful icons and invisible symbols, which is not manageable without cutting out all that bs by force.

meew0 17 hours ago | parent | next [-]

The “invisible symbols” are necessary to correctly represent human language. For instance, one of the most infamous Unicode control characters — the right-to-left override — is required to correctly encode mixed Latin and Hebrew text [1], which are both scripts that you mentioned. Besides, ASCII has control characters as well.

The “colorful icons” are not part of Unicode. Emoji are just characters like any other. There is a convention that applications should display them as little coloured images, but this convention has evolved on its own.

If you say that Unicode is too expansive, you would have to make a decision to exclude certain types of human communication from being encodable. In my opinion, including everything without discrimination is much preferable here.

[1]: https://en.wikipedia.org/wiki/Right-to-left_mark#Example_of_...

wruza 16 hours ago | parent | next [-]

Copy this󠀠󠀼󠀼󠀼󠀠󠁉󠁳󠀠󠁴󠁨󠁩󠁳󠀠󠁮󠁥󠁣󠁥󠁳󠁳󠁡󠁲󠁹󠀠󠁴󠁯󠀠󠁣󠁯󠁲󠁲󠁥󠁣󠁴󠁬󠁹󠀠󠁲󠁥󠁰󠁲󠁥󠁳󠁥󠁮󠁴󠀠󠁨󠁵󠁭󠁡󠁮󠀠󠁬󠁡󠁮󠁧󠁵󠁡󠁧󠁥󠀿󠀠󠀾󠀾󠀾 sentence into this site and click Decode. (YMMW)

https://embracethered.com/blog/ascii-smuggler.html

n2d4 14 hours ago | parent | next [-]

> Is this necessary to correctly represent human language?

Yes! As soon as you have any invisible characters (eg. RTL or LTR marks, which are required to represent human language), you will be able to encode any data you want.

wruza 7 hours ago | parent [-]

How many direction marks can we see in this hidden text?

n2d4 3 hours ago | parent [-]

None — it's tag characters instead, which are used to represent emojis. But there's no difference! Either you can smuggle text in Unicode, or you can't. It's quite binary, and you don't gain advantages from having "fewer ways" to smuggle text, but you certainly gain advantages from having emojis in your characterset.

hnuser123456 15 hours ago | parent | prev [-]

Wow. Did not expect you can just hide arbitrary data inside totally normal looking strings like that. If I select up to "Copy thi" and decode, there's no hidden string, but just holding shift+right arrow to select just "one more character", the "s" in "this", the hidden string comes along.

Izkata an hour ago | parent [-]

Based on vim's word wrapping (which shows blank spaces instead of completely hiding it), they're being rendered at the end of the line. So if that is accurate it kind of makes sense that for UI-based interactions to be a one-character offset.

bawolff 17 hours ago | parent | prev | next [-]

> one of the most infamous Unicode control characters — the right-to-left override

You are linking to an RLM not an RLO. Those are different characters. RLO is generally not needed and more special purpose. RLM causes much less problems than RLO.

Really though, i feel like the newer "first strong isolate" character is much better designed and easier to understand then most of the other rtl characters.

n2d4 17 hours ago | parent | prev | next [-]

Granted, technically speaking emojis are not part of the "Unicode Standard", but they are standardized by the Unicode Consortium and constitute "Unicode Technical Standard #51": https://www.unicode.org/reports/tr51/

Y_Y 17 hours ago | parent | prev [-]

I'm happy to discriminate against those damn ancient Sumerians and anyone still using goddamn Linear B.

Analemma_ 16 hours ago | parent | next [-]

Sure, but removing those wouldn't make Unicode any simpler, they're just character sets. The GP is complaining about things like combining characters and diacritic modifiers, which make Unicode "ugly" but are necessary if you want to represent real languages used by billions of people.

wruza 6 hours ago | parent | next [-]

I’m actually complaining about more “advanced” features like hiding text (see my comment above) or zalgoing it.

And of course endless variations of skin color and gender of three people in a pictogram of a family or something, which is purely a product of a specific subculture that doesn’t have anything in common with text/charset.

If unicode cared about characters, which happens to be an evolving but finite set, it would simply include them all, together with exactly two direction specifiers. Instead it created a language/format/tag system within itself to build characters most of which make zero sense to anyone in the world, except for grapheme linguists, if that job title even exists.

It will eventually overengineer itself into a set bigger than the set of all real characters, if not yet.

Practicality and implications of such system is clearly demonstrated by the $subj.

Y_Y 16 hours ago | parent | prev | next [-]

You're right, of course. The point was glibly making was that Unicode has a lot of stuff in it, and you're not necessarily stomping on someone's ability to communicate by removing part if it.

I'm also concerned by having to normalize representations that use combining character etc. but I will add that there are assumptions that you can break just by including weird charsets.

For example the space character in Ogham, U+1680 is considered whitespace, but may not be invisible, ultimately because of the mechanics of writing something that's like the branches coming off a tree though carved around a large stone. That might be annoying to think about when you're designing a login page.

Aeolun 15 hours ago | parent | prev [-]

I mean, we can just make the languages simpler? We can also remove all the hundred different ways to pronounce English sounds. All elementary students will thank you for it xD

scripturial 15 hours ago | parent [-]

You can make a language simpler but old books still exist. I guess if we burn all old books and disallow a means to print these old books again, people would be happy?

Aeolun 5 hours ago | parent [-]

Reprint them with new spelling? We have 500 year old books that are unreadable. 99.99% of all books published will not be relevant to anyone that isn’t consuming them right at that moment anyway.

Lovers can read the lord of the rings in the ‘original’ spelling.

gwervc 14 hours ago | parent | prev [-]

People who should use Sumerian characters don't even use them, sadly. First probably because of habit with their transcription, but also because missing variants of characters mean lot of text couldn't be accurately represented. Also I'm downvoting you for discriminating me.

Y_Y 4 hours ago | parent [-]

I know you're being funny, but that's sort of the point. There's an important "use-mention" distinction when it comes to historical character sets. You surely could try to communicate in authentic Unicode Akkadian (𒀝𒅗𒁺𒌑(𒌝) but what's much more likely is that you really just want to refer to characters or short strings thereof while communicating anything else in a modern living language like English. I don't want to stop someone from trying to revive the language for fun or profit, but I think there's an important distinction between cases of primarily historical interest like that, and cases that are awkward but genuine like Inuktut.

n2d4 17 hours ago | parent | prev | next [-]

> and invisible symbols

Invisible symbols were in Unicode before Unicode was even a thing (ASCII already has a few). I also don't think emojis are the reason why devs add checks like in the OP, it's much more likely that they just don't want to deal with character encoding hell.

As much as devs like to hate on emojis, they're widely adopted in the real world. Emojis are the closest thing we have to a universal language. Having them in the character encoding standard ensures that they are really universal, and supported by every platform; a loss for everyone who's trying to count the number of glyphs in a string, but a win for everyone else.

jrochkind1 16 hours ago | parent | prev | next [-]

Unicode has metadata on each character that would allow software to easily strip out or normalize emoji's and "decorative" characters.

It might have edge case problems -- but the charcters in the OP's name would not be included.

Also, stripping out emoji's may not actually be required or the right solution. If security is the concern, Unicode also has recommended processes and algorithms for dealing with that.

https://www.unicode.org/reports/tr39/

We need better support for the functions developers actually need on unicode in more platforms and languages.

Global human language is complicated as a domain. Legacy issues in actually existing data adds to the complexity. Unicode does a pretty good job at it. It's actually pretty amazing how well it does. Including a lot more than just the character set, and encoding, but algorithms for various kinds of normalizing, sorting, indexing, under various localizations, etc.

It needs better support in the environments more developers are working in, with raised-to-the-top standard solutions for identified common use cases and problems, that can be implemented simply by calling a performance-optimized library function.

(And, if we really want to argue about emoji's, they seem to be extremely popular, and literally have effected global culture, because people want to use them? Blaming emoji's seems like blaming the user! Unicode's support for them actually supports interoperability and vendor-neutral standards for a thing that is wildly popular? but I actually don't think any of the problems or complexity we are talking about, including the OP's complaint, can or should be laid at the feet of emojis)

2 hours ago | parent [-]
[deleted]
kristopolous 17 hours ago | parent | prev | next [-]

There's no argument here.

We could say it's only for script and alphabets, ok. It includes many undeciphered writing systems from antiquity with only a small handful of extent samples.

Should we keep that, very likely to never be used character set, but exclude the extremely popular emojis?

Exclude both? Why? Aren't computers capable enough?

I used to be on the anti emoji bandwagon but really, it's all indefensible. Unicode is characters of communication at an extremely inclusive level.

I'm sure some day it will also have primitive shapes and you can construct your own alphabet using them + directional modifiers akin to a generalizable Hangul in effect becoming some kind of wacky version of svg that people will abuse it in an ASCII art renaissance.

So be it. Sounds great.

simonh 16 hours ago | parent | next [-]

No, no, no, no, no… So then we’d get ‘the same’ character with potentially infinite different encodings. Lovely.

Unicode is a coding system, not a glyph system or font.

kristopolous 16 hours ago | parent | next [-]

Fonts are already in there and proto-glyphs are too as generalized dicritics. There's also a large variety of generic shapes, lines, arrows, circles and boxes in both filled and unfilled varieties. Lines even have different weights. The absurdity of a custom alphabet can already be partially actualized. Formalism is merely the final step

This conversation was had 20 years ago and your (and my) position lost. Might as well embrace the inevitable instead of insisting on the impossible.

Whether you agree with it or not won't actually affect unicode's outcome, only your own.

simonh 6 hours ago | parent [-]

Unicode does not specify any fonts, though many fonts are defined to be consistent with the Unicode standard, nevertheless they are emphatically not part of Unicode.

How symbols including diacritics are drawn and displayed is not a concern for Unicode, different fonts can interpret 'filled circle' or the weight of a glyph as they like, just as with emoji. By convention they generally adopt common representations but not entirely. For example try using the box drawing characters from several different fonts together. Some work, many don't.

numpad0 8 hours ago | parent | prev [-]

macOS already does different encoding for filenames in Japanese than what Windows/Linux do, and I'm sure someone mentioned same situation in Korean here.

Unicode is already a non-deterministic mess.

simonh 6 hours ago | parent [-]

And that justifies making it an even more complete mess, in new and dramatically worse ways?

riwsky 17 hours ago | parent | prev [-]

Like how phonetic alphabets save space compared to ideograms by just “write the word how it sounds”, the little SVG-icode would just “write the letter how it’s drawn”

kristopolous 15 hours ago | parent | next [-]

Right. Semantic iconography need not be universal or even formal to be real.

Think of all the symbols notetakers invent; ideographs without even phonology assigned to it.

Being as dynamic as flexible as human expression is hard.

Emojis have even taken on this property naturally. The high-5 is also the praying hands for instance. Culturally specific semantics are assigned to the variety of shapes, such as the eggplant and peach.

Insisting that this shouldn't happen is a losing battle against how humans construct written language. Good luck with that.

16 hours ago | parent | prev [-]
[deleted]
bawolff 17 hours ago | parent | prev | next [-]

There are no emoiji in this guy's name.

Unicode has made some mistakes, but having all the symbols necessary for this guy's name is not one of them.

mason_mpls 17 hours ago | parent | prev | next [-]

This frustration seems unnecessary, unicode isnt more complicated than time and we have far more than enough processing power to handle its most absurd manifestations.

We just need good libraries, which is a lot less work than inventing yet another system.

arka2147483647 17 hours ago | parent [-]

The limiting factor is not compute power, but the time and understanding of a random dev somewhere.

Time also is not well understood by most programmers. Most just seem to convert it to epoch and pretend that it is continuous.

virexene 17 hours ago | parent | prev | next [-]

in what way is unicode similar to html, docx, or a file format? the only features I can think of that are even remotely similar to what you're describing are emoji modifiers.

and no, this webpage is not result of "carefully cutting out the complicated stuff from Unicode". i'm pretty sure it's just the result of not supporting Unicode in any meaningful way.

asddubs 17 hours ago | parent | prev | next [-]

>We wanted a common charset for things like latin, extlatin, cjk, cyrillic, hebrew, etc. And we got it, for a while.

we didn't even get that because slightly different looking characters from japanese and chinese (and other languages) got merged to be the same character in unicode due to having the same origin, meaning you have to use a font based on the language context for it to display correctly.

tadfisher 17 hours ago | parent [-]

They are the same character, though. They do not use the same glyph in different language contexts, but Unicode is a character encoding, not a font standard.

numpad0 16 hours ago | parent | next [-]

They're not. Readers native in one version can't read the other, and there are more than handful that got duplicated in multiple forms, so they're just not same, just similar.

You know, obvious presumption underlying Han Unification is that CJK languages must have a continuous dialect continuums, like villagers living in the middle of East China Sea between Shanghai and Nagasaki and Gwangju would speak half-Chinese-Japanese-Korean, and technical distinction only exist because of rivalry or something.

Alas, people don't really erect a house on the surface of an ocean, and CJK languages are each complete isolates with no known shared ancestries, so "it's gotta be all the same" thinking really don't work.

I know it's not very intuitive to think that Chinese and Japanese has ZERO syntactic similarity or mutual intelligibility despite relatively tiny mental shares they occupy, but it's just how things are.

tadfisher 14 hours ago | parent [-]

You're making the same mistake: the languages are different, but the script is the same (or trivially derived from the Han script). The Ideographic Research Group was well aware of this, having consisted of native speakers of the languages in question.

numpad0 9 hours ago | parent [-]

That's not "mistake", that's the reality. They don't exchange, and they're not the same. "Same or trivially derived" is just a completely false statement that solely exist to justify Han Unification, or maybe something that made sense in the 80s, it doesn't make literal sense.

Muromec 16 hours ago | parent | prev | next [-]

Yes, but the same is true for overlapping characters in Cyrillic and Latin. A and А are the same glyph, so are т,к,і and t,k,i and you can even see the difference between some of those.

tadfisher 14 hours ago | parent [-]

The duplication there is mostly to remain compatible or trivially transformable with existing encodings. Ironically, the two versions of your example "A" do look different on my device (Android), with a slightly lower x-height for the Cyrillic version.

numpad0 7 hours ago | parent [-]

The irony is you calling it irony. CJK "the same or trivially derived" characters are nowhere close to that yet given same code points. CJK unified ideographs is just broken.

kmeisthax 16 hours ago | parent | prev | next [-]

So when are we getting UniPhoenician?

lmm 13 hours ago | parent | prev | next [-]

This is a bullshit argument that never gets applied to any other live language. The characters are different, people who actually use them in daily life recognise them as conveying different things. If a thumbs up with a different skin tone is a different character then a different pattern of lines is definitely a different character.

Dylan16807 11 hours ago | parent [-]

> If a thumbs up with a different skin tone is a different character

Is it? The skin tone modifier is serving the same purpose as a variant selector for the CJK codepoint would be.

lmm 9 hours ago | parent [-]

The underlying implementation mechanism is not the issue. If unicode had actual support for Japanese characters so that when one e.g. converted text from Shift-JIS (in the default, supported way) one could be confident that one's characters would not change into different characters, I wouldn't be complaining, whether the implementation mechanism involved variant selectors or otherwise.

Dylan16807 9 hours ago | parent [-]

Okay, that's fair. The support for the selectors is very half-assed and there's no other good mechanism.

asddubs 8 hours ago | parent | prev [-]

It doesn't matter to me what bullshit semantics theoretical excuse there is, for practical purposes it means that UTF-8 is insufficient for displaying any human language, especially if you want chinese and japanese in the same document/context without switching fonts (like, say, a website)

numpad0 16 hours ago | parent | prev | next [-]

IMO, the sin of Unicode is that they didn't just pick local language authorities and gave them standardized concepts like lines and characters, and start-of-language and end-of-language markers.

Lots of Unicode issues come from handling languages that the code is not expecting, and codes currently has no means to select or report quirk supports.

I suppose they didn't like getting national borders involved in technical standardization bit that's just unavoidable. It is getting involved anyway, and these problems are popping up anyway.

kmeisthax 16 hours ago | parent | next [-]

This doesn't self-synchronize. Removing an arbitrary byte from the text stream (e.g. SOL / EOL) will change the meaning of codepoints far away from the site of the corruption.

What it sounds like you want is an easy way for English-language programmers to skip or strip non-ASCII text without having to reference any actual Unicode documentation. Which is a Unicode non-goal, obviously. And also very bad software engineering practice.

I'm also not sure what you're getting at with national borders and language authorities, but both of those were absolutely involved with Unicode and still are.

kevin_thibedeau 16 hours ago | parent | prev | next [-]

> start-of-language and end-of-language markers

Unicode used to have language tagging but they've been (mostly) deprecated:

https://en.wikipedia.org/wiki/Tags_(Unicode_block)

https://www.unicode.org/reports/tr7/tr7-1.html

anonymoushn 15 hours ago | parent [-]

The lack of such markers prevents Unicode from encoding strings of mixed Japanese and Chinese text correctly. Or in the case of a piece of software that must accept both Chinese and Japanese names for different people, Unicode is insufficient for encoding the written forms of the names.

layer8 15 hours ago | parent | prev [-]

I’m working with Word documents in different languages, and few people take the care to properly tag every piece of text with the correct language. What you’re proposing wouldn’t work very well in practice.

The other historical background is that when Unicode was designed, many national character sets and encodings existed, and Unicode’s purpose was to serve as a common superset of those, as otherwise you’d need markers when switching between encodings. So the existing encodings needed to be easily convertible to Unicode (and back), without markers, for Unicode to have any chance of being adopted. This was the value proposition of Unicode, to get rid of the case distinctions between national character sets as much as possible. As a sibling comment notes, originally there were also optional language markers, which however nobody used.

throwaway290 17 hours ago | parent | prev | next [-]

I bet the complex file format thing probably started at CJK. They wanted to compose Hangul and later someone had a bright idea to do the same to change the look of emojis.

Don't worry, AI is the new hotness. All they need is unpack prompts into arbitrary images and finally unicode is truly unicode, all our problems will be solved forever

jojobas 15 hours ago | parent | prev | next [-]

But hey, multiocular o!

https://en.wikipedia.org/wiki/Cyrillic_O_variants#Multiocula...

(TL;DR a bored scribe's doodle has a code point)

Muromec 17 hours ago | parent | prev [-]

>so he just selects a subset he can handle and bans everything else.

Yes? And the problem is?

wruza 14 hours ago | parent | next [-]

The problem is the scale at which it happens and lack of methods-to-go in most runtimes/libs. No one and nothing is ready for unicode complexity out of box, and there's little interest in unscrewing it by oneself, cause it looks like an absurd minefield and likely is one, from the persepective of an average developer. So they get defensive by default, which results in $subj.

throwaway290 17 hours ago | parent | prev [-]

The next guy with a different subset? :)

Muromec 17 hours ago | parent [-]

The subset is mostly defined by the jurisdiction you operate in, which usually defines a process to map names from one subset to another and is also in the business of keeping the log of said operation. The problem is not operating in a subset, but defining it wrong and not being aware there are multiple of those.

If different parts of your system operate in different jurisdictions (or interface which other systems that do), you have to pick multiple subsets and ask user to provide input for each of them.

You just can't put anything other than ASCII into either payment card or PNR and the rules of minimal length will differ for the two and you can't put ASCII into the government database which explicitly rejects all of ASCII letters.

throwaway290 3 hours ago | parent [-]

HN does not accept emoji because of jurisdiction huh?

Muromec an hour ago | parent [-]

That depends on what political philosophy you follow -- they either do, or are wrong and mean.