Remix.run Logo
spyrja 7 days ago

I really hate to rant on about this. But the gymnastics required to parse UTF-8 correctly are truly insane. Besides that we now see issues such as invisible glyph injection attacks etc cropping up all over the place due to this crappy so-called "standard". Maybe we should just to go back to the simplicity of ASCII until we can come up with with something better?

danhau 7 days ago | parent | next [-]

Are you referring to Unicode? Because UTF-8 is simple and relatively straight forward to parse.

Unicode definitely has its faults, but on the whole it‘s great. I‘ll take Unicode w/ UTF-8 any day over the mess of encodings we had before it.

Needless to say, Unicode is not a good fit for every scenario.

xg15 7 days ago | parent | next [-]

I think GP is really talking about extended grapheme clusters (at least the mention of invisible glyph injection makes me think that)

Those really seem hellish to parse, because there seem to be several mutually independent schemes how characters are combined to clusters, depending on what you're dealing with.

E.g. modifier characters, tags, zero-width joiners with magic emoji combinations, etc.

So you need both a copy of the character database and knowledge of the interaction of those various invisible characters.

spyrja 7 days ago | parent | prev [-]

Just as an example of what I am talking about, this is my current UTF-8 parser which I have been using for a few years now.

  bool utf_append_plaintext(utf* result, const char* text) {
  #define msk(byte, mask, value) ((byte & mask) == value)
  #define cnt(byte) msk(byte, 0xc0, 0x80)
  #define shf(byte, mask, amount) ((byte & mask) << amount)
    utf_clear(result);
    if (text == NULL)
      return false;
    size_t siz = strlen(text);
    uint8_t* nxt = (uint8_t*)text;
    uint8_t* end = nxt + siz;
    if ((siz >= 3) && (nxt[0] == 0xef) && (nxt[1] == 0xbb) && (nxt[2] == 0xbf))
      nxt += 3;
    while (nxt < end) {
      bool aok = false;
      uint32_t cod = 0;
      uint8_t fir = nxt[0];
      if (msk(fir, 0x80, 0)) {
        cod = fir;
        nxt += 1;
        aok = true;
      } else if ((nxt + 1) < end) {
        uint8_t sec = nxt[1];
        if (msk(fir, 0xe0, 0xc0)) {
          if (cnt(sec)) {
            cod |= shf(fir, 0x1f, 6);
            cod |= shf(sec, 0x3f, 0);
            nxt += 2;
            aok = true;
          }
        } else if ((nxt + 2) < end) {
          uint8_t thi = nxt[2];
          if (msk(fir, 0xf0, 0xe0)) {
            if (cnt(sec) && cnt(thi)) {
              cod |= shf(fir, 0x0f, 12);
              cod |= shf(sec, 0x3f, 6);
              cod |= shf(thi, 0x3f, 0);
              nxt += 3;
              aok = true;
            }
          } else if ((nxt + 3) < end) {
            uint8_t fou = nxt[3];
            if (msk(fir, 0xf8, 0xf0)) {
              if (cnt(sec) && cnt(thi) && cnt(fou)) {
                cod |= shf(fir, 0x07, 18);
                cod |= shf(sec, 0x3f, 12);
                cod |= shf(thi, 0x3f, 6);
                cod |= shf(fou, 0x3f, 0);
                nxt += 4;
                aok = true;
              }
            }
          }
        }
      }
      if (aok)
        utf_push(result, cod);
      else
        return false;
    }
    return true;
  #undef cnt
  #undef msk
  #undef shf
  }
Not exactly "simple", is it? I am almost embarrassed to say that I thought I had read the spec right. But of course I was obviously wrong and now I have to go back to the drawing board (or else find some other FOSS alternative written in C). It just frustrates me. I do appreciate the level of effort made to come up with an all-encompassing standard of sorts, but it just seems so unnecessarily complicated.
simonask 7 days ago | parent | next [-]

That's a reasonable implementation in my opinion. It's not that complicated. You're also apparently insisting on three-letter variable names, and are using a very primitive language to boot, so I don't think you're setting yourself up for "maintainability" here.

Here's the implementation in the Rust standard library: https://doc.rust-lang.org/stable/src/core/str/validations.rs...

It even includes an optimized fast path for ASCII, and it works at compile-time as well.

spyrja 7 days ago | parent | next [-]

Well it is a pretty old codebase, the whole project is written in C. I haven't done any Rust programming yet but it does seem like a good choice for modern programs. I'll check out the link and see if I can glean any insights into what needs to be done to fix my ancient parser. Thanks!

koakuma-chan 6 days ago | parent | prev [-]

> You're also apparently insisting on three-letter variable names

Why are the arguments not three-letter though? I would feel terrible if that was my code.

spyrja 6 days ago | parent [-]

It's just a convention I use for personal projects. Back when I started coding in C, people often just opted to go with one or two character variable names. I chose three for locally-scoped variables because it was usually enough to identify them in a recognizable fashion. The fixed-width nature of it all also made for less eye-clutter. As for function arguments, the fact that they were fully spelled out made it easier for API reference purposes. At the end of the day all that really matters is that you choose a convention and stick with it. For team projects they should be laid out early on and, as long as everyone follows them, the entire project will have a much better sense of consistency.

koakuma-chan 6 days ago | parent [-]

Oh. The thing is, I really like my code formatted symmetrically, aligned evenly, etc. I go as far as adding empty comments to prevent the formatter from removing my custom line breaks. I thought you were the same, in a way ;)

e.g., https://github.com/mayo-dayo/app/blob/0.4/src/middleware.ts

simonask 6 days ago | parent [-]

For God's sake, what's wrong with you. :-)

Just set your editor's line-height.

danhau 2 days ago | parent | prev [-]

I don't know what your code is doing exactly. For comparison, here's my utf8 decoder (for a single codepoint):

    static UnicodeCodepoint utf8_decode(u8 const bytes[static 4], u8 *out_num_consumed) {
        u8 const flipped = ~bytes[0];
        if (flipped == 0) {
            // Because __builtin_clz is UB for value 0.
            // When his happens, the UTF-8 is malformed.
            *out_num_consumed = 1;
            return 0;
        }
        
        u8 const num_ones = __builtin_clz(flipped) & 0x07;
        u8 const num_bytes_total = num_ones > 1 ? num_ones : 1;
        u8 const main_byte_shift = num_ones + 1;
        UnicodeCodepoint value = bytes[0] & (0xFF >> main_byte_shift);
        
        for (u8 i = 1; i < num_bytes_total; ++i) {
            if (bytes[i] >> 6 != 2) {
                // Not a valid continuation byte.
                *out_num_consumed = i;
                return 0;
            }
            
            value = (value << 6) | (bytes[i] & 0x3F);
        }

        *out_num_consumed = num_bytes_total;
        return value;
    }
guappa 7 days ago | parent | prev | next [-]

Sure, I'll just write my own language all weird and look like an illiterate so that you are not inconvenienced.

kalleboo 7 days ago | parent | prev | next [-]

I think what you meant is we should all go back to the simplicity of Shift-JIS

eru 7 days ago | parent | prev | next [-]

You could use a standard that always uses eg 4 bytes per character, that is much easier to parse than UTF-8.

UTF-8 is so complicated, because it wants to be backwards compatible with ASCII.

account42 7 days ago | parent | next [-]

ASCII compatibility isn't the only advantage of UTF-8 over UCS-4. It also

- requires less memory for most strings, particular ones that are largely limited to ASCII like structured text-based formats often are.

- doesn't need to care about byte order. UTF-8 is always UTF-8 while UTF-16 might either be little or big endian and UCS-4 could theoretically even be mixed endian.

- doesn't need to care about alignment: If you jump to a random memory position you can find the next and previous UTF-8 characters. This also means that you can use preexisting byte-based string functions like substring search for many UTF-8 operations.

degamad 7 days ago | parent | prev | next [-]

It's not just the variable byte length that causes an issue, in some ways that's the easiest part of the problem. You also have to deal with code points that modify other code points, rather than being characters themselves. That's a huge part of the problem.

amake 7 days ago | parent | next [-]

That has nothing to do with UTF-8; that's a Unicode issue, and one that's entirely unescapable if you are the Unicode Consortium and your goal is to be compatible with all legacy charsets.

degamad 6 days ago | parent [-]

Yep, that's the point I was making - that choosing fixed 4-byte code-points doesn't significantly reduce the complexity of capturing everything that Unicode does.

eru 4 days ago | parent [-]

Thanks for explaining!

bawolff 7 days ago | parent | prev [-]

That goes all the way back to the beginning

Even ascii used to use "overstriking" where the backspace character was treated as a joiner character to put accents above letters.

degamad 6 days ago | parent [-]

Agreed, we just conveniently forget about those when speaking about how complex Unicode is.

spyrja 7 days ago | parent | prev [-]

True. But then again, backward-compatibility isn't really such a hard to do with ASCII because the MSB is always zero. The problem I think is that the original motivation which ultimately lead to the complications we now see with UTF-8 was based on a desire to save a few bits here and there rather than create a straight-forward standard that was easy to parse. I am actually staring at 60+ lines of fairly pristine code I wrote a few years back that ostensibly passed all tests, only to find out that in fact it does not cover all corner cases. (Could have sworn I read the spec correctly, but apparently not!)

eru 7 days ago | parent [-]

You might like https://fsharpforfunandprofit.com/series/property-based-test...

spyrja 6 days ago | parent [-]

In this particular case it was simply a matter of not enough corner cases defined. I was however using property-based testing, doing things like reversing then un-reversing the UTF-8 strings, re-ordering code points, merging strings, etc for verification. The datasets were in a variety of languages (including emojis) and so I mistakenly thought I had covered all the bases.

But thank you for the link, it's turning out to be a very enjoyable read! There already seems to be a few things I could do better thanks to the article, besides the fact that it codifies a lot of interesting approaches one can take to improve testing in general.

eru 4 days ago | parent [-]

Nice! Glad to be of service.

Python, of all languages, probably has the best property based testing library out there with "hypothesis". I sometimes even use it to drive tests for my Haskell and OCaml and Rust code. The author of Hypothesis wrote a few nice articles about why his approach is better (and I agree), however I can't find them at the moment..

Ekaros 7 days ago | parent | prev [-]

Should have just gone with 32 bit characters and no combinations. Utter simplicity.

guappa 7 days ago | parent | next [-]

That would be extremely wasteful, every single text file would be 4x larger and I'm sure eventually it would not be enough anyway.

Ekaros 7 days ago | parent [-]

Maybe we should have just replaced ascii, horrible encoding were entire 25% of it is wasted. And maybe we could have gotten a bit more efficiency by saying instead of having both lower and uppercase letters just have one and then have a modifier before it. Saving lot of space as most text could just be lowercase.

guappa 7 days ago | parent [-]

yeah that's how ascii works… there's 1 bit for lower/upper case.

bawolff 7 days ago | parent | prev | next [-]

I think combining characters are a lot simpler than having every single combination ever.

Especially when you start getting into non latin-based languages.

amake 7 days ago | parent | prev [-]

What does "no combinations" mean?

Ekaros 7 days ago | parent [-]

Like say Ä it might be either Ä a single byte, or combination of ¨ and A. Both are now supported, but if you can have more than two such things going in one thing it makes a mess.

amake 7 days ago | parent | next [-]

That's fundamental to the mission of Unicode because Unicode is meant to be compatible with all legacy character sets, and those character sets already included combining characters.

So "no combinations" was never going to happen.

int_19h 6 days ago | parent | prev [-]

That quickly explodes if you need more than one diacritic per letter (e.g. Vietnamese often has two, and then there's https://en.wikipedia.org/wiki/International_Phonetic_Alphabe...).