So many parser combinators operate on bytes assuming ASCII input only. I'd be more interested in a parser combinator lib that has UTF-8 decoding already abstracted away, operating on `wchar_t`, or even polymorphic input stream element types.

▲

lokeg 2 days ago | parent | next [-]

Isn't working with the utf8 stream sufficient? Especially if you only have ASCII keywords/operators/brackets, I feel a ASCII parser should work with utf8 out of the box

▲

t-3 2 days ago | parent [-]

Yeah, a parser has no need to understand what a string or glyph is, let alone ASCII or UTF-8. The point is to take a stream of arbitrary data and process it into something that can be reasoned about. Unless you know your input stream is regular in some way, processing it at the finest level of granularity (usually bytes) is probably the only thing to do.

	▲	paulddraper 2 days ago \| parent [-]
		Well it depends whether you parsing binary (byte stream) or text (character stream). In practice, lots of text formats (JSON, XML) embed or hint the character encoding in the format.

▲

Joker_vD 2 days ago | parent | prev | next [-]

I'd rather not. Most of the time, you don't need it, and when you do, it's for a very small part of the input. And `wchar_t` is an abomination (it's UTF-32 on Linux, UTF-16 on Windows, and all of that is allowed); you probably really want `char32_t`, and again, not for the whole of the input; streaming such data a single rune/codepoint at a time is probably fine as well for most uses.

On the other hand, if your parser combinators process char-by-char, then maintaining a small "is this valid UTF-8 so far" context on the side should be pretty simple, so providing it would be an useful option, but actually decoding? Please don't.

▲

eska 2 days ago | parent | prev | next [-]

I’d still use a byte slice for that. Some formats may mix encodings, or have a text header and binary payload. For those cases one would need to use memchr for the first byte, then compare the remaining few bytes. So I don’t think it would be a huge performance impact

▲

RossBencina 2 days ago | parent | prev [-]

I'm not familiar with parser combinators. The parser generators that I'm familiar with (YACC, ANTLR3,5) parse a stream of lexemes/tokens, not characters. Is there a reason why combinators don't operate on lexemes?

	▲	Jtsummers 2 days ago \| parent \| next [-]
		They can, it's just that often people seem to use parser combinators to build both the lexer and the parser, not just the parser, which means dealing with the character stream. If you separate the two steps, parser combinators just dealing with tokens works just fine.
	▲	t-3 2 days ago \| parent \| prev [-]
		A parser combinator takes parsers as input and produces a new parser. The basic parsers are very simple, but they are combined together to produce more complex parsers.