| ▲ | zombot 2 days ago | ||||||||||||||||
So many parser combinators operate on bytes assuming ASCII input only. I'd be more interested in a parser combinator lib that has UTF-8 decoding already abstracted away, operating on `wchar_t`, or even polymorphic input stream element types. | |||||||||||||||||
| ▲ | lokeg 2 days ago | parent | next [-] | ||||||||||||||||
Isn't working with the utf8 stream sufficient? Especially if you only have ASCII keywords/operators/brackets, I feel a ASCII parser should work with utf8 out of the box | |||||||||||||||||
| |||||||||||||||||
| ▲ | Joker_vD 2 days ago | parent | prev | next [-] | ||||||||||||||||
I'd rather not. Most of the time, you don't need it, and when you do, it's for a very small part of the input. And `wchar_t` is an abomination (it's UTF-32 on Linux, UTF-16 on Windows, and all of that is allowed); you probably really want `char32_t`, and again, not for the whole of the input; streaming such data a single rune/codepoint at a time is probably fine as well for most uses. On the other hand, if your parser combinators process char-by-char, then maintaining a small "is this valid UTF-8 so far" context on the side should be pretty simple, so providing it would be an useful option, but actually decoding? Please don't. | |||||||||||||||||
| ▲ | eska 2 days ago | parent | prev | next [-] | ||||||||||||||||
I’d still use a byte slice for that. Some formats may mix encodings, or have a text header and binary payload. For those cases one would need to use memchr for the first byte, then compare the remaining few bytes. So I don’t think it would be a huge performance impact | |||||||||||||||||
| ▲ | RossBencina 2 days ago | parent | prev [-] | ||||||||||||||||
I'm not familiar with parser combinators. The parser generators that I'm familiar with (YACC, ANTLR3,5) parse a stream of lexemes/tokens, not characters. Is there a reason why combinators don't operate on lexemes? | |||||||||||||||||
| |||||||||||||||||