▲ | gritzko 4 days ago | |||||||||||||||||||||||||
I specialize in protocol design, unfortunately. A while ago I had to code some Unicode conversion routines from scratch and I must say I absolutely admire UTF-8. Unicode per se is a dumpster fire, likely because of objective reasons. Dealing with multiple Unicode encodings is a minefield. I even made an angry write-up back then https://web.archive.org/web/20231001011301/http://replicated... UTF-8 made it all relatively neat back in the day. There are still ways to throw a wrench into the gears. For example, how do you handle UTF-8 encoded surrogate pairs? But at least one can filter that out as suspicious/malicious behavior. | ||||||||||||||||||||||||||
▲ | sedatk 4 days ago | parent | next [-] | |||||||||||||||||||||||||
> For example, how do you handle UTF-8 encoded surrogate pairs? Surrogate pairs aren’t applicable to UTF-8. That part of Unicode block is just invalid for UTF-8 and should be treated as such (parsing error or as invalid characters etc). | ||||||||||||||||||||||||||
| ||||||||||||||||||||||||||
▲ | cryptonector 4 days ago | parent | prev [-] | |||||||||||||||||||||||||
> Unicode per se is a dumpster fire Maybe as to emojis, but otherwise, no, Unicode is not a dumpster fire. Unicode is elegant, and all the things that people complain about in Unicode are actually problems in human scripts. |