Remix.run Logo
RustyRussell 4 days ago

Did anyone else find the use if ABNF annoying?

  unicode-assignable =
   %x9 / %xA / %xD /               ; useful controls
   %x20-7E /                       ; exclude C1 controls and DEL
   %xA0-D7FF /                     ; exclude surrogates
   %xE000-FDCF /                   ; exclude FDD0 nonchars
   %xFDF0-FFFD /                   ; exclude FFFE and FFFF nonchars
   %x10000-1FFFD / %x20000-2FFFD / ; (repeat per plane)
   %x30000-3FFFD / %x40000-4FFFD /
   %x50000-5FFFD / %x60000-6FFFD /
   %x70000-7FFFD / %x80000-8FFFD /
   %x90000-9FFFD / %xA0000-AFFFD /
   %xB0000-BFFFD / %xC0000-CFFFD /
   %xD0000-DFFFD / %xE0000-EFFFD /
   %xF0000-FFFFD / %x100000-10FFFD
I mean, just define ranges.

Also, where are the test vectors? Because when I implement this, that's the first thing I have to write, and you could save me a lot of work here. Bonus points if it's in JSON and UTF-8 already, though the invalid UTF-8 in an RFC might really gum things up: hex encode maybe?

timbray 4 days ago | parent [-]

The tests for the go code at https://github.com/timbray/RFC9839 are in effect test vectors.

RustyRussell 3 days ago | parent [-]

I want to implement this. My code is in C.

How does this help me check my implementation? I guess I could ask ChatGPT to convert your tests to my code, but that seems the long way around.

djoldman 3 days ago | parent [-]

https://github.com/timbray/RFC9839/blob/main/unichars.go

I don't know rust at all but I can pretty quickly understand:

    var unicodeAssignables = []runePair{
     {0x20, 0x7E},       // ASCII
     {0xA, 0xA},         // newline
     {0xA0, 0xD7FF},     // most of the BMP
     {0xE000, 0xFDCF},   // BMP after surrogates
     {0xFDF0, 0xFFFD},   // BMP after noncharacters block
     {0x9, 0x9},         // Tab
     {0xD, 0xD},         // CR
     {0x10000, 0x1FFFD}, // astral planes from here down
     {0x20000, 0x2FFFD},
     {0x30000, 0x3FFFD},
     {0x40000, 0x4FFFD},
     {0x50000, 0x5FFFD},
     {0x60000, 0x6FFFD},
     {0x70000, 0x7FFFD},
     {0x80000, 0x8FFFD},
     {0x90000, 0x9FFFD},
     {0xA0000, 0xAFFFD},
     {0xB0000, 0xBFFFD},
     {0xC0000, 0xCFFFD},
     {0xD0000, 0xDFFFD},
     {0xE0000, 0xEFFFD},
     {0xF0000, 0xFFFFD},
     {0x100000, 0x10FFFD},
    }