Remix.run Logo
danhau 7 days ago

Are you referring to Unicode? Because UTF-8 is simple and relatively straight forward to parse.

Unicode definitely has its faults, but on the whole it‘s great. I‘ll take Unicode w/ UTF-8 any day over the mess of encodings we had before it.

Needless to say, Unicode is not a good fit for every scenario.

xg15 7 days ago | parent | next [-]

I think GP is really talking about extended grapheme clusters (at least the mention of invisible glyph injection makes me think that)

Those really seem hellish to parse, because there seem to be several mutually independent schemes how characters are combined to clusters, depending on what you're dealing with.

E.g. modifier characters, tags, zero-width joiners with magic emoji combinations, etc.

So you need both a copy of the character database and knowledge of the interaction of those various invisible characters.

spyrja 7 days ago | parent | prev [-]

Just as an example of what I am talking about, this is my current UTF-8 parser which I have been using for a few years now.

  bool utf_append_plaintext(utf* result, const char* text) {
  #define msk(byte, mask, value) ((byte & mask) == value)
  #define cnt(byte) msk(byte, 0xc0, 0x80)
  #define shf(byte, mask, amount) ((byte & mask) << amount)
    utf_clear(result);
    if (text == NULL)
      return false;
    size_t siz = strlen(text);
    uint8_t* nxt = (uint8_t*)text;
    uint8_t* end = nxt + siz;
    if ((siz >= 3) && (nxt[0] == 0xef) && (nxt[1] == 0xbb) && (nxt[2] == 0xbf))
      nxt += 3;
    while (nxt < end) {
      bool aok = false;
      uint32_t cod = 0;
      uint8_t fir = nxt[0];
      if (msk(fir, 0x80, 0)) {
        cod = fir;
        nxt += 1;
        aok = true;
      } else if ((nxt + 1) < end) {
        uint8_t sec = nxt[1];
        if (msk(fir, 0xe0, 0xc0)) {
          if (cnt(sec)) {
            cod |= shf(fir, 0x1f, 6);
            cod |= shf(sec, 0x3f, 0);
            nxt += 2;
            aok = true;
          }
        } else if ((nxt + 2) < end) {
          uint8_t thi = nxt[2];
          if (msk(fir, 0xf0, 0xe0)) {
            if (cnt(sec) && cnt(thi)) {
              cod |= shf(fir, 0x0f, 12);
              cod |= shf(sec, 0x3f, 6);
              cod |= shf(thi, 0x3f, 0);
              nxt += 3;
              aok = true;
            }
          } else if ((nxt + 3) < end) {
            uint8_t fou = nxt[3];
            if (msk(fir, 0xf8, 0xf0)) {
              if (cnt(sec) && cnt(thi) && cnt(fou)) {
                cod |= shf(fir, 0x07, 18);
                cod |= shf(sec, 0x3f, 12);
                cod |= shf(thi, 0x3f, 6);
                cod |= shf(fou, 0x3f, 0);
                nxt += 4;
                aok = true;
              }
            }
          }
        }
      }
      if (aok)
        utf_push(result, cod);
      else
        return false;
    }
    return true;
  #undef cnt
  #undef msk
  #undef shf
  }
Not exactly "simple", is it? I am almost embarrassed to say that I thought I had read the spec right. But of course I was obviously wrong and now I have to go back to the drawing board (or else find some other FOSS alternative written in C). It just frustrates me. I do appreciate the level of effort made to come up with an all-encompassing standard of sorts, but it just seems so unnecessarily complicated.
simonask 7 days ago | parent | next [-]

That's a reasonable implementation in my opinion. It's not that complicated. You're also apparently insisting on three-letter variable names, and are using a very primitive language to boot, so I don't think you're setting yourself up for "maintainability" here.

Here's the implementation in the Rust standard library: https://doc.rust-lang.org/stable/src/core/str/validations.rs...

It even includes an optimized fast path for ASCII, and it works at compile-time as well.

spyrja 7 days ago | parent | next [-]

Well it is a pretty old codebase, the whole project is written in C. I haven't done any Rust programming yet but it does seem like a good choice for modern programs. I'll check out the link and see if I can glean any insights into what needs to be done to fix my ancient parser. Thanks!

koakuma-chan 6 days ago | parent | prev [-]

> You're also apparently insisting on three-letter variable names

Why are the arguments not three-letter though? I would feel terrible if that was my code.

spyrja 6 days ago | parent [-]

It's just a convention I use for personal projects. Back when I started coding in C, people often just opted to go with one or two character variable names. I chose three for locally-scoped variables because it was usually enough to identify them in a recognizable fashion. The fixed-width nature of it all also made for less eye-clutter. As for function arguments, the fact that they were fully spelled out made it easier for API reference purposes. At the end of the day all that really matters is that you choose a convention and stick with it. For team projects they should be laid out early on and, as long as everyone follows them, the entire project will have a much better sense of consistency.

koakuma-chan 6 days ago | parent [-]

Oh. The thing is, I really like my code formatted symmetrically, aligned evenly, etc. I go as far as adding empty comments to prevent the formatter from removing my custom line breaks. I thought you were the same, in a way ;)

e.g., https://github.com/mayo-dayo/app/blob/0.4/src/middleware.ts

simonask 6 days ago | parent [-]

For God's sake, what's wrong with you. :-)

Just set your editor's line-height.

danhau 2 days ago | parent | prev [-]

I don't know what your code is doing exactly. For comparison, here's my utf8 decoder (for a single codepoint):

    static UnicodeCodepoint utf8_decode(u8 const bytes[static 4], u8 *out_num_consumed) {
        u8 const flipped = ~bytes[0];
        if (flipped == 0) {
            // Because __builtin_clz is UB for value 0.
            // When his happens, the UTF-8 is malformed.
            *out_num_consumed = 1;
            return 0;
        }
        
        u8 const num_ones = __builtin_clz(flipped) & 0x07;
        u8 const num_bytes_total = num_ones > 1 ? num_ones : 1;
        u8 const main_byte_shift = num_ones + 1;
        UnicodeCodepoint value = bytes[0] & (0xFF >> main_byte_shift);
        
        for (u8 i = 1; i < num_bytes_total; ++i) {
            if (bytes[i] >> 6 != 2) {
                // Not a valid continuation byte.
                *out_num_consumed = i;
                return 0;
            }
            
            value = (value << 6) | (bytes[i] & 0x3F);
        }

        *out_num_consumed = num_bytes_total;
        return value;
    }