Remix.run Logo
lyu07282 10 months ago

> The only validation I'd always enforce is some sane length limit, [..]

Venture into the abyss of UTF-8 and behold the madness of multibyte characters. Diacritics dance devilishly upon characters, deceiving your simple count. Think a letter is but a single entity? Fools! Combining characters lurk in the shadows, binding invisibly, elongating the uninitiated's count into chaos. Every attempt to enumerate the true length of a string in UTF-8 conjures a specter of complications. Behold, a single glyph, yet multiple bytes cackle beneath, a multitude of codepoints coalesce in arcane unison. It is beautiful t he final snuffing of the lie s of Man ALL IS LOST ALL I S LOST the pony he comes he comes he comes the ich or permeates all MY FACE MY FACE ᵒh god no NO NOOO O NΘ stop the an * gles are n ot real ZALGΌ IS TOƝȳ THE PO NY HE COMES

hnfong 10 months ago | parent [-]

A nit - it's not UTF-8 or "multibyte" characters that's the main problem. The UTF-8 issue can be trivially resolved by decoding it into unicode code points. As long as you're fine with the truncated length not always corresponding to what you'd expect for Latin based alphabets it should be fine. (FWIW, if you are concerned with the displayed length, you'd need a font and a text layout engine to calculate the display length of displayed text)

The main issue with naïve truncation is that not every code point is a character (and I guess not every character is a glyph?). If you truncate the Unicode code point array at some unfortunate places like https://en.wikipedia.org/wiki/Ideographic_Description_Charac... , you'd just get gibberish or potentially very unintended results. (especially if you joined the truncated string with some other string)