▲ | danhau 7 days ago | |||||||||||||||||||||||||||||||||||||||||||||||||||||||
Are you referring to Unicode? Because UTF-8 is simple and relatively straight forward to parse. Unicode definitely has its faults, but on the whole it‘s great. I‘ll take Unicode w/ UTF-8 any day over the mess of encodings we had before it. Needless to say, Unicode is not a good fit for every scenario. | ||||||||||||||||||||||||||||||||||||||||||||||||||||||||
▲ | xg15 7 days ago | parent | next [-] | |||||||||||||||||||||||||||||||||||||||||||||||||||||||
I think GP is really talking about extended grapheme clusters (at least the mention of invisible glyph injection makes me think that) Those really seem hellish to parse, because there seem to be several mutually independent schemes how characters are combined to clusters, depending on what you're dealing with. E.g. modifier characters, tags, zero-width joiners with magic emoji combinations, etc. So you need both a copy of the character database and knowledge of the interaction of those various invisible characters. | ||||||||||||||||||||||||||||||||||||||||||||||||||||||||
▲ | spyrja 7 days ago | parent | prev [-] | |||||||||||||||||||||||||||||||||||||||||||||||||||||||
Just as an example of what I am talking about, this is my current UTF-8 parser which I have been using for a few years now.
Not exactly "simple", is it? I am almost embarrassed to say that I thought I had read the spec right. But of course I was obviously wrong and now I have to go back to the drawing board (or else find some other FOSS alternative written in C). It just frustrates me. I do appreciate the level of effort made to come up with an all-encompassing standard of sorts, but it just seems so unnecessarily complicated. | ||||||||||||||||||||||||||||||||||||||||||||||||||||||||
|