Remix.run Logo
nicoburns 4 days ago

I wonder who it is that likes other kinds of parser. Over the last ~10 years or so I've read several articles arguing that recursive descent parsers are in fact great on HN. And they seem to be both the easiest to get started with and what almost all production-grade systems use. I've seen very little in the way of anything arguing for any other approaches.

o11c 4 days ago | parent | next [-]

Recursive descent is fine if you trust that you won't write buggy code. If you implement a generator for it (easy enough), this may be a justifiable thing to trust (though this is not a given). I am assuming you're willing to put up with the insanity of grammar rewriting, one way or another.

LR however is more powerful, though this mostly matters if you don't have access to automatic grammar rewriting for your LL. More significantly, however, there's probably more good tooling for LR (or perhaps: you can assume that if tooling exists, it is good at what it is designed for); one problem with LL being so "simple" is that there's a lot of bad tooling out there.

The important things are 1. that you meaningfully eliminate ambiguities (which is easy to enforce for LR and doable for LL if your tooling is good), and 2. that you keep linear time complexity. Any parser other than LL/LR should be rejected because it fails at least one of these, and often both.

Within the LL and LR families there are actually quite a few members. SLR(1) is strong enough to be interesting but too weak for anything I would call a "language". LALR(1) is probably fine; I have never encountered a useful language that must resort to LR(1) (though note that modern tooling can do an optimistic fallback, avoiding the massive blowups of ancient LR tools). SLL(1) I'm not personally familiar with. X(k), where X is one of {SLL, LL, SLR, LALR, LR} and where k > 1, are not very useful; k=1 suffices. LL(*) however should be avoided due to backtracking, but in some cases consider if you can parsing token trees first (this is currently poorly represented in the literature; you want to be doing some form of this for error recovery anyway - automated error recovery is a useless lie) and/or defer the partial ambiguity until the AST is built (often better for error messages anyway, independent of using token trees).

kerkeslager 4 days ago | parent | next [-]

> Recursive descent is fine if you trust that you won't write buggy code. If you implement a generator for it (easy enough), this may be a justifiable thing to trust (though this is not a given).

The idea that you're going to hand-roll a parser generator and then use that to generate a parser and the result is going to be less buggy than just hand-rolling a recursive descent parser, screams "I've never written code outside of an academic context".

pjc50 3 days ago | parent | next [-]

One of the smartest projects I've ever seen was a tool that took the human-readable tables of the HEVC and AV1 specs, used them as input to https://en.wikipedia.org/wiki/OMeta parser-generator, and then output both HEVC parsers in a variety of languages and also auto-fuzzers for test coverage. Ended up at https://www.graphcore.ai/posts/graphcore-open-sources-argon-...

Personally I've also written a parser-generator for XML in C# to overcome some of the odd limitations of Microsoft's one when used in AOT contexts.

Hand-rolling is easy if the grammar is small. The larger it gets (and video codecs are huge!) the more you want something with automatic consistency.

kerkeslager 3 days ago | parent [-]

Sure, if you need parsers in a dozen languages, then a parser generator might to make sense because you're not writing one parser, you're writing a dozen.

But, the vast majority of parsers I've written didn't have this requirement. I needed to write one parser in one language.

maxbond 4 days ago | parent | prev | next [-]

> [It] screams "I've never written code outside of an academic context".

SQLite, perhaps the most widely deployed software system, takes this approach.

https://sqlite.org/lemon.html

> The Lemon LALR(1) Parser Generator

> The SQL language parser for SQLite is generated using a code-generator program called "Lemon".

> ...

> Lemon was originally written by D. Richard Hipp (also the creator of SQLite) while he was in graduate school at Duke University between 1987 and 1992.

Here are the grammars, if you're curious.

https://github.com/sqlite/sqlite/blob/master/src/parse.y

mpyne 4 days ago | parent | next [-]

SQLite is kind of cheating here, you won't catching me writing my own source control management system either.

But I do think the wider point is still true, that there can be real benefit to implementing 2 proper layered abstractions rather than implementing 1 broader abstraction where the complexity can span across more of the problem domain.

kerkeslager 3 days ago | parent | prev [-]

Yeah, let me know when you're writing the next SQLite. For your average parser, you're not writing the SQLite parser, you don't have the SQLite parser's problems, and you don't need SQLite's solutions.

lanstin a day ago | parent | next [-]

There is a great Steve Yegge post on how useful ad-hoc transformation of source code is: http://steve-yegge.blogspot.com/2007/06/rich-programmer-food...

The only time I have used this myself was an expat style transformer for terraform (HCL). We had a lot of terraform and they kept changing the language, so I would build a fixer to make code written for say 0.10 to work with 0.12 and then again for 0.14. It was very fun and let us keep updating to newer terraform versions. Pretty simple language except for distinguishing quoted blocks from non-quoted.

maxbond 3 days ago | parent | prev [-]

Most people aren't writing something as complex as SQLite, but most people aren't writing parsers either. Those writing parsers are disproportionately writing things like programming languages and language servers that are quite complex.

SQLite isn't some kind of universal template, I'm not saying people should copy it or that recursive descent is a bag choice. But empirically parser generators are used in real production systems. SQLite is unusual in that they also wrote the parser generator, but otherwise is in good company. Postgres uses Bison, for example.

Additionally, I think that Lemon was started as a personal learning project in grad school (as academic a project as it gets) and evolved into a component of what is probably the most widely deployed software system of all time shows this distinction between what is academic and what is practical isn't all that meaningful to begin with. What's academic becomes practical when the circumstances are right. Better to evaluate a technique in the context of your problem than to prematurely bin things into artificial categories.

kerkeslager a day ago | parent [-]

> Those writing parsers are disproportionately writing things like programming languages and language servers that are quite complex.

Sure, but adding the complexity of a parser generator doesn't help with that complexity in most cases.

[General purpose] programming languages are a quintessential example. Yes, a compiler or an interpreter is a very complex program. But unless your programming language needs to be parsed in multiple languages, you definitely do not need to generate the parser in many languages like SQLite does. That just adds complexity for no reason.

You can't just say "it's complex, therefore it needs a parser generator" if adding the parser generator doesn't address the complexity in any way.

motorest 4 days ago | parent | prev [-]

> The idea that you're going to hand-roll a parser generator and then use that to generate a parser and the result is going to be less buggy than just hand-rolling a recursive descent parser, screams "I've never written code outside of an academic context"

Your comment is quite funny as hand-rolling a recursive descent parser is the kind of thing that is often accused of being a) bug-prone, b) only done in academic environments.

vidarh 3 days ago | parent | next [-]

Having written several parser generators, all my production parsers are hand-written - either pure recursive descent or a combination of recursive descent and operator precedence parsing.

The reason being that the reason there are so many parser generators is largely that we keep desperately looking for a way of writing one that isn't sheer pain in production use.

paddim8 4 days ago | parent | prev | next [-]

What? Accused of only being done in academic environments? Never heard that. Academics seem to spend 99% of their time talking about parser generators and LR parsing for some reason while most production compilers have handwritten recursive descent parsers...

jfyi 3 days ago | parent [-]

Same here, recursive descent is what I have run into in the real world.

I'm just happy when parsing isn't being done with some absurdly long regex with no documentation.

raincole 4 days ago | parent | prev [-]

[flagged]

nrds 3 days ago | parent | prev [-]

> ambiguities

It's important to note that ambiguities are something which exist in service of parser generators and the restricted formal grammars that drive them. They do not actually exist in the language to be parsed (unless that language is not well-specified, but then all bets are off and it is meaningless to speak of parsing), because they can be eliminated by side-conditions.

For example, one famous ambiguity is the dangling 'else' problem in C. But this isn't an actual ambiguity in the C language: the language has a side-condition which says that 'else' matches to the closest unmatched 'if'. This is completely unambiguous and so a recursive descent parser for C simply doesn't encounter this problem. It is only because parser generators, at least in their most academic form, lack a way to specify this side-condition that their proponents have to come up with a whole theory of "ambiguities". (Shockingly, Wikipedia gets this exactly right in the article on dangling else which I just thought to look up: "The dangling else is a problem in programming of parser generators".)

Likewise goes the problem of left-recursion. Opponents of recursive descent always present left-recursion as a gotcha which requires some special handling. Meanwhile actual programmers writing actual recursive descent parsers don't have any idea what these academics are talking about because the language that they're parsing (as it exists in their mind) doesn't feature left-recursion, but instead iteration. Left-recursion is only introduced in service of restricted formal grammars in which recursion is the only available primitive and iteration either doesn't exist or is syntactic sugar for recursion. For the recursive descent user, iteration is a perfectly acceptable primitive. The reason for the discrepancy goes back to side-conditions: iteration requires a side-condition stating how to build the parse tree; parser generators call this "resolving the ambiguity" because they can't express this in their restricted grammar, not because the language was ambiguous.

layer8 3 days ago | parent | next [-]

> They do not actually exist in the language to be parsed (unless that language is not well-specified

How do you specify your language “well” when you don’t know if your grammar is unambiguous? Determining whether a grammar is ambiguous is famously undecidable in the general case. So how do you decide, if you don’t restrict your grammar to one of the decidable forms checkable by parser generators? You can add some disambiguation rules, but how do you know they cover all ambiguities?

We use formal systems exactly to make sure that the language is well-defined.

o11c 3 days ago | parent | prev [-]

Dangling "else" isn't actually a problem for parser generators. All you have to do is either:

* use proper rules rather than cramming everything into "statement", or

* specify explicit precedence rules, which is just a shortcut for the above (also skipping useless reductions)

Doing this is ubiquitous with parser generators when dealing with vaguely Algol-like languages, and is no different than the fact that you have to do the same thing for expressions.

jasperry 4 days ago | parent | prev | next [-]

But remember that the articles arguing for recursive descent parsers are arguing against the long-dominant paradigm of using LR parsers. Plenty of us still like LR parser generators (see my other comment.)

In between "easiest to get started with" and "what production-grade systems use", there is "easy to actually finish a medium-sized project with." I think LR parsers still defend that middle ground pretty well.

nicoburns 4 days ago | parent [-]

> But remember that the articles arguing for recursive descent parsers are arguing against the long-dominant paradigm of using LR parsers

That was part of my question I think. I wouldn't have been able to tell you that the dominant paradigm being argued against was LR parsers, because I've never come across even one that I'm aware of (I've heard of them, but that's about it). Perhaps it's academia where they're popular?

jasperry 4 days ago | parent [-]

I did learn about LR parser generators first in my Compilers class in college, but I assumed they were generally known about in language development communities.

userbinator 4 days ago | parent | prev | next [-]

I wonder who it is that likes other kinds of parser.

It seems to be mainly academics and others interested in parsing theory, and those who like complexity for the sake of complexity.

masklinn 3 days ago | parent | prev | next [-]

Pratt parsers are really fun if slightly mind-bending, their ability to handle odd associativities, precedences, and arities is basically unmatched making them really useful to embed inside recursive descent for when you reach expressions. If you need infix and mixfix operators anyway.

lenkite 4 days ago | parent | prev | next [-]

The literature for incremental parsing doesn't appear to have much for recursive descent. Everyone appears to use the LR tree sitter approach.

cxr 3 days ago | parent | prev [-]

The post by Laurence Tratt, which this piece is a response to, argues for another approach and is mentioned in the first sentence.