Remix.run Logo
sieve 3 hours ago

Anyone trying to do this... the first thing you do is avoid lex/yacc/bison/antlr. You do not need all this ceremony. A recursive descent parser that uses Pratt parsing will work for a vast majority of cases.

The lexer/parser is never the bottleneck. In fact, you can write those two by hand over a single weekend for a largish language. With LLMs, it takes 15 minutes if you have an unambiguous spec.

The biggest time sink, and the reason you will fail for sure, is the inability to restrict the scope of the project. You start with a limited feature set and produce the entire compiler/vm toolchain. Then you get greedy and fiddle with the type system, adding features that you have never used and probably never will. And now you have to change every single phase from start to end.

I mostly give up at this stage.

wg0 2 hours ago | parent | next [-]

Jonathan Blow wrote his own game enginee and for that he wrote his own programming language.

He went through straight recursive descendant parser and said same thing.

I think compiler courses teach from yacc, bison etc that's where this whole thing came from but in practice people discovered that hand written recursive descendant parsers are all you need.

sieve 2 hours ago | parent [-]

> I think compiler courses teach from yacc, bison etc that's where this whole thing came from

Very true. I have a shelf full of books on compiler development and optimization. I have read them selectively, a chapter here, a chapter there. But that shelf is useless for a vast majority of people.

You might find it useful if you are developing a production-level compiler/vm (I cannot make this statement with a straight face while Python rules the world). But a simple and sensible architecture that uses recursive-descent parsing takes you a long way.

Most hobbyist compilers (and even some production ones) are written as a heavy front-end compiling down to C or LLVM. Very few people actually write their own backend.

tehologist an hour ago | parent [-]

Re: bison and yacc. It came from the dragon book which for forever was the way to learn to write languages.

pan69 an hour ago | parent | prev | next [-]

I learned to do this about 2 years ago (pre LLM). I have been developing software for ~30 years and somehow doing something like this was a major mental obstacle, mostly created by the perception of "the dragon book", as in this topic being full of mystical unobtainable incantations, so I never even dared venture into this space. Silly, I know. However, after diving into this and learning to write a recursive descent parser for a DSL I wanted to write, it felt like I'd acquired a superpower. Totally understand that there is many more layers to all of this, layer that can get very complex, but just learning that first bit...

sieve 14 minutes ago | parent [-]

I wish people would start with Nystrom's https://www.craftinginterpreters.com/ and avoid the dragon etc unless they really, really need it. Almost everything I have learnt about compiler/vm development, I have done so by reading random blogs and articles on various aspects and small tutorials on writing parsers and vms.

Even stuff like Crenshaw's Let's Build a Compiler was more useful to me than all these books that do lexical analysis using regular expressions. I have written lexers and parsers hundreds of times for all kinds of DSLs and config languages and not once have I used regular expressions to scan the text.

true_religion 3 hours ago | parent | prev [-]

I wrote a few of these due to an interest in compilers and hardware.

The easiest syntax to copy if you’re looking for a high level language is Smalltalk.

But most of the time, I wouldn’t even use that. Simple imperative languages that look like BASIC works pretty well in most domains. If you simplify the syntax a little, it’s very easy to understand the compiler and use it for say when you want users to input code into existing systems.

sieve 2 hours ago | parent [-]

I have written compilers for two families over the years: C and ML. My current preference is Python. I am currently working on a statically typed language that is inspired by Python (minus objects and OOP) that runs on a register VM.

Syntax is a minor issue but something that people are very opinionated about. You could technically build multiple front ends that share the typechecking, CFG validation, optimization, register allocation and byte code emission phases. But it is too much work for what is presently a personal project.