Remix.run Logo
chrismorgan 4 days ago

> How easy it is to parse doesn't matter.

How easy it is to parse does matter, because there’s a definite correlation between how easy it is to parse for the computer and for you. When there are bad corner cases, you either have to learn the rules, or keep on producing erroneous and often-content-destructive formatting.

> How easy it is to extend is largely irrelevant.

If you’re content with stock CommonMark, it is irrelevant to you.

If you want to go beyond that, you’re in for a world of pain and mangled content, content that you often won’t notice is mangled until much later, because there’s generally no meaningful way of sanity-checking stuff.

As soon as you interact with more than one Markdown engine—which is extremely likely to happen, your text editor is probably not using the parser your build tool uses, configured as it is configured—it matters a lot. If you have ever tried migrating from one engine to another on anything beyond the basics, you will have encountered problems because of this.

Arainach 4 days ago | parent | next [-]

It's miserable to parse C++ and that's fine, because only a few people have to write a parser while 5 orders of magnitude more have to read and write it. Same thing with markdown - the user experience is what matters.

Edge cases largely don't matter, because again I'm not trying to make a book. I don't care if my table is off by a few pixels. 50% of the time I'm reading markdown it's not even formatted, it's just in raw format in an editor.

chrismorgan 4 days ago | parent | next [-]

If you write C++ in a way that it will misparse, you will normally get a hard error that you have to fix. (Also, the complexity is mostly fairly well encapsulated.)

If you write Markdown in a way that your engine will misparse, you may well not notice—the compiler probably doesn’t even have the notion of throwing an error. The fact that you’re normally working with the unformatted text makes it even more likely that you won’t notice. (And the complexity is badly encapsulated, too.)

I have seen errors due to Markdown complexity so often. I just dealt with a blog where some images were completely missing because they’d written <img> tags in order to add some class attributes, and the engine didn’t allow raw HTML, and they didn’t notice they were gone (probably because ![]() images were still there). Markdown is so technically unsound it’s sometimes quite distressing.

We’re not talking about a table being off by a few pixels. We’re talking about words being mangled, text disappearing, much more severe stuff. Markdown isn’t even capable of table stuff.

yawaramin 4 days ago | parent [-]

You are missing the point. We are talking about not even parsing the Markdown. We are talking about reading it raw. Literally raw-dogging it. At that point it doesn't even matter, we just want a format that's brain-dead simple.

HTML transformation is a bonus on top of that. If we want that we will mandate a specific Markdown engine with a strict parser.

chrismorgan 4 days ago | parent [-]

Actually I think you’re missing the point. “Parsing” is not something that computers alone do; humans do it. You see text and understand it to be text, you see <img> and understand it to be an HTML tag (and hopefully know whether your engine will pass it through, or leave it as text, or strip it), you see **double asterisks** and understand it to be bold or strong emphasis.

If you only care about reading it raw, you don’t bother with Markdown. Some of what you write will be the same as Markdown, but not all—for example, no one would use its ridiculous link or image syntax.

The reason you write with Markdown syntax is because you want to be able to format it (even if you will normally consume it in plain text yourself). And once you’re using Markdown syntax, you need to know the rules, to a greater or lesser extent. You must be able to parse Markdown syntax mentally. If you don’t know the rules well enough, your mental parse will be incorrect, and you’ll produce errors. Errors that won’t be detected by your computer. That’s the hazard with Markdown compared to C++: your mental parser is buggy and incomplete, but with C++ the computer will generally catch your errors while with Markdown it will never catch your errors.

SiempreViernes 4 days ago | parent | next [-]

Read their parsing statement in context:

> Markdown is that it's easy to read, and its second-biggest advantage is that it's easy to write. How easy it is to parse doesn't matter

After saying MD is "easy to read" the meaning of "parsing" is clearly limited to automated parsing by non-humans, and the only reasonable reading is "provided the markup is easy to read for humans, the difficulty in constructing an automated parser is irrelevant".

chrismorgan 4 days ago | parent [-]

Reading is not sufficient. If you want it to produce the appropriate HTML, you must parse too.

When you write a file name a_b_c in one place, and a mathematical expression a*b*c in another place, and you don’t want to use `code formatting`, you need to know Markdown’s rules. Because otherwise, you’ll write a*b*c and get abc, instead of writing a\*b\*c to get a*b*c.

(And those are only the exact rules if you’re using CommonMark. On another engine, it might behave differently.)

If you only want to read, don’t use Markdown. But if you want to process as well, you need to know the processing.

indymike 3 days ago | parent | prev | next [-]

> C++ the computer will generally catch your errors while with Markdown it will never catch your errors.

Conveying meaning at the bitwise operator level is a different thing than applying emphasis to a few words in a sentence with bolding or embedding a hyperlink in a document.

chrismorgan 3 days ago | parent [-]

I’ve frequently seen mistakes in Markdown syntax that lead to content that has at best partially-broken formatting, at worst losing some of the content, sometimes even in ways that aren’t obvious.

Markup versus computer code is of course not exactly the same, but the nature of the mistakes—tokens in places they’re not supposed to be, and such—would generally lead to a syntax error in C++.

yawaramin 3 days ago | parent | prev [-]

No, I am positive you are missing the point.

> no one would use its ridiculous link or image syntax.

And many don't, which is fine! But some do, if they remember the syntax. Markdown is tolerant of that, and ultimately if the file is rendered to HTML Markdown engines know to just turn raw URLs into hyperlinks.

> The reason you write with Markdown syntax is because you want to be able to format it

Maybe sometimes. Not always. That's the point. A lot of the time it's nice that most technical people who write docs in text files all agree on what headings, lists, emphasis etc. should look like in plain text so we don't have to constantly do a dance of negotiating what the markup is. And the bonus on top of that is we can also get a reasonable HTML page out of it.

> If you don’t know the rules well enough, your mental parse will be incorrect, and you’ll produce errors. Errors that won’t be detected by your computer. That’s the hazard with Markdown

I mean, 'hazard'. Kind of an over-the-top way to put it. It's a text file for documentation purposes, not a production system handling money or something. Nobody cares if the Markdown has a few syntactic errors. The point is to convey information to other humans in a reasonably efficient way.

thaumasiotes 4 days ago | parent | prev [-]

> It's miserable to parse C++ and that's fine, because only a few people have to write a parser while 5 orders of magnitude more have to read and write it.

Really? I was under the impression that the fact that it is miserable to parse C++ directly means that it's also miserable to compile C++ - it can't be done quickly - which is something that everyone has to do all the time.

vkazanov 4 days ago | parent [-]

FYI: Parsing and compiling in the programming language sense are orthogonal problems. Both are major challenges in cpp compilers.

thaumasiotes 4 days ago | parent | next [-]

What I've read is that C++'s biggest compiling problem is specifically that the language is difficult to parse. You can't compile without parsing, so no, they're not orthogonal problems. Compiling is a parsing step followed by an emission step.

(And just to be completely clear, I'm not saying that the difficulty of parsing C++ makes it miserable to write a compiler. I'm saying that the difficulty of parsing C++ makes it miserable to run a compiler.)

blenderob 4 days ago | parent | prev [-]

> FYI: Parsing and compiling in the programming language sense are orthogonal problems.

How so? In Ada, Fortran, C, C++, Java, Python, etc. parsing is one of the many phases of compiling. Far from being orthogonal problems, parsing is a sub-problem of compiling.

Pet_Ant 3 days ago | parent [-]

The amount of time being consumed by parsing is vanishingly small. It's a lot like the decoding time spent on x86 code is marginal nowadays compared to the speculative and reordering logic.

YACC was called "Yet Another Compiler Compiler" because back in the day parsing was the bulk of compilation, now it's relatively minimal.

thiht 4 days ago | parent | prev [-]

> there’s a definite correlation between how easy it is to parse for the computer and for you

I’m not sure that’s true tbh. Exhibit A: natural language. Exhibit B: Polish notation.

chrismorgan 4 days ago | parent [-]

I don’t see how either of those exhibits demonstrate your point.

I believe various research has shown that humans and machines parse natural language in rather similar ways. Garden-path sentences <https://en.wikipedia.org/wiki/Garden-path_sentence> are a fun demonstration of how human sentence parsing involves speculation and backtracking.

Polish notation is easy for both to parse; humans only struggle because they’re not so familiar with it.

(By adulthood, human processing biases extremely heavily toward the familiar. Computer parsing has to be implemented from scratch, so there’s not so much concept of familiarity, though libraries can encapsulate elements of parsing.)

saghm 3 days ago | parent | next [-]

> Polish notation is easy for both to parse; humans only struggle because they’re not so familiar with it

I think you're downplaying the significance of this. The lack of familiarity is exactly what I'd argue makes a huge difference in practice even if theoretically the way our brains parse things isn't that different. We spend so much time reading and writing words that it requires effort to learn how to parse each specific symbol-oriented thing we might want to learn how to read. To add to the parent comment's examples, I'll throw in Brainfuck, which is an extremely simple language for a machine to learn to parse that's literally named for how impenetrable it looks to people at first glance.

"Simple if I spend the time to learn it" is not the same as "simple without having to spend time to learn it", and for some things, the fact that the syntax essentially ignores some of the finer details is the main feature rather than a drawback. When everyone I work with can read and write markdown good enough for us not to have major issues, and junior engineers can get up to basically the same level of competence in it without needing a lot of hand holding, it's just not worth the effort for me to try to convince everyone to use RST even if it is better in theory. The total amount of time I've spent dealing with the minor annoyances in markdown in my life is less than the amount of time it would probably take me to convince even one of my coworkers that we should switch all of our READMEs to RST.

thiht 3 days ago | parent | prev [-]

> I don’t see how either of those exhibits demonstrate your point.

Natural language is easy to do for a human and a hard computing problem.

Polish notation is extremely simple to implement, but relatively "hard" for a human, even knowing the rules and how to read it. See: `+ * - 15 6 / 20 4 ^ 2 3 - + 7 8 * 3 2`

chrismorgan 3 days ago | parent | next [-]

> Natural language is easy to do for a human and a hard computing problem.

You ever see someone learning a new language? They struggle hard on more complex sentences.

It’s easy for us because we’ve practised it so much.

> + * - 15 6 / 20 4 ^ 2 3 - + 7 8 * 3 2

To begin with, you’re missing an operator. I’ll assume another leading +.

  + + * - 15 6 / 20 4 ^ 2 3 - + 7 8 * 3 2
Now, if you use infix, you have to have at least some of the parentheses, in this case actually only one pair, given rules of operator precedence, associativity and commutativity:

  (15 - 6) * 20 / 4 + 2 ^ 3 + 7 + 8 - 3 * 2
But you may well just parenthesise everything, it makes solving easier:

  ((((15 - 6) * (20 / 4)) + (2 ^ 3)) + ((7 + 8) - (3 * 2)))
And you know how you go about solving it? Calculating chunks from the inside out, and replacing them with their values:

  (((    9    *     5   ) +    8   ) + (  15    -    6   ))
  ((         45           +    8   ) +          9         )
  (                      53          +          9         )
                                    62
Coming back to Polish notation—you know what? It’s exactly the same:

  (+ (+ (* (- 15 6) (/ 20 4)) (^ 2 3)) (- (+ 7 8) (* 3 2)))
  (+ (+ (* 9 5) 8) (- 15 6))
  (+ (+ 45 8) 9)
  (+ 53 9)
  62
For arithmetic at least, it’s not hard. You’re just not accustomed to it.
Arainach 3 days ago | parent [-]

This is a really weird hill to die on. HP tried hard to make RPN a thing and even among engineers eventually lost out to notation that is easier to work with.

People read in one direction - in English left to right. They read faster and comprehend better when they can move in that direction without constantly jumping back and forth.

> (15 - 6) * 20 / 4 + 2 ^ 3 + 7 + 8 - 3 * 2

(15-6)*20/4 can be read as one block left to right

2^3 can be read as one block left to right. Jump back to the operator (count: 1)

7 + 8 continue left to right

3*2 is a block, jump back to operator (count: 2)

So that reads left to right as speakers of most western languages do with only two context shifts. Now let's try RPN:

> + + * - 15 6 / 20 4 ^ 2 3 - + 7 8 * 3 2

ignore, ignore, ignore, ignore.

15, 6, context shift (1)

ignore?

20, context shift (2)

4, context shift (3)

ignore?

2 (wait, am I supposed to use that caret? I'm already confused and I've used RPN calculators before. Counting this as a context shift (4))

3, context shift (5)

two more operators and I don't really understand why any more

basically, RPN makes you context shift every single time you enter a number. It is utter chaos to understand of jumping back and forth and trying to remember what came before and happens next. Even if you're used to it it's dramatically worse for humans, and no one cares how much software it takes to parse.

Incidentally from my experience with RPN calculators I'd have expected

15 6 - 20 * 4 / 2 3 ^ + 7 + 8 + 3 2 * -

Though it's not really better since instead of context shifting after every number you have to context shift after ever operator to try to remember what's on the stack

fluidcruft 3 days ago | parent | prev [-]

Polish notation looks like a nightmare for expressing something like a partial differential equation. Even combining fractions looks like it's going to be a nightmare.