Remix.run Logo
lifthrasiir 4 days ago

Guaranteed, reST is more feature-complete and extension-friendly, but it is simply unusable for me because it wasn't designed for agglutinative languages like Korean. Markdown is much better in this case (though CommonMark has an annoying edge case [1]).

[1] https://talk.commonmark.org/t/foo-works-but-foo-fails/2528

thaumasiotes 4 days ago | parent | next [-]

> reST is more feature-complete and extension-friendly, but it is simply unusable for me because it wasn't designed for agglutinative languages like Korean.

How does whether you think of the language as agglutinative affect the usability of reST?

The biggest problem that occurs to me is that there isn't really a conceptual difference between an "agglutinative" language in which you have very long words expressing complex meanings, and an "isolating" language in which the same syllables occur in the same order with the same meaning but are thought of on a Platonic level as being all independent words.

This is because an "agglutinative" language is one in which syntax markers are more or less independent of any other syntax markers that may apply to the same word†, which means it's always possible by definition to consider those markers to be "words" themselves.

Would your problems be solved if you viewed what you had considered "long" Korean words as instead being several short words in a row? What difficulties does agglutination present?

† Compare: https://glossary.sil.org/term/agglutinative-language

> An agglutinative language is a language in which words are made up of a linear sequence of distinct morphemes and each component of meaning is represented by its own morpheme.

https://glossary.sil.org/term/isolating-language

> An isolating language is a language in which almost every word consists of a single morpheme.

lifthrasiir 4 days ago | parent | next [-]

> This is because an "agglutinative" language is one in which syntax markers are more or less independent of any other syntax markers that may apply to the same word†, which means it's always possible by definition to consider those markers to be "words" themselves.

I think SIL's definition is, while robust, not the usual definition because English can be regarded as agglutinative in this definition. This is particularly visible from the statement that most European languages are somewhat fusional [1], which is okay under their definitions but not the usual way we think of English.

In my understanding, the analyticity is a spectrum and highly analytic languages with most (but not necessarily all) words containing just one morpheme are said to be isolating. Words in agglutinative languages can be, but not necessarily have to be, analyzed as a main morpheme ("word") with dependent morphemes attached ("affixes"). Polysynthetic languages go further by allowing multiple main morphemes in one word. As languages tend to become synthetic (as opposed to analytic), the space-separated "word" is less useful [2] and segmentation gets harder and harder. reST's failure to support those languages is all about a bad assumption about segmentation.

[1] https://glossary.sil.org/term/fusional-language

[2] So much that several agglutinative languages---in which space-separated words can still be useful---don't even think about spacing, e.g. Japanese.

thaumasiotes 4 days ago | parent [-]

> I think SIL's definition is, while robust, not the usual definition because English can be regarded as agglutinative in this definition. This is particularly visible from the statement that most European languages are somewhat fusional, which is okay under their definitions but not the usual way we think of English.

Well, in the first place, I don't put much stock in the idea that "the usual way we think of" a language is a good way to determine the characteristics of that language. A good example here would be Finnish, which has a large number of particles that appear to be independent of the words they modify, but which are traditionally referred to as "case markers" by analogy to European languages that have case. Finnish is said to have an extraordinarily large number of cases, but that is because each Finnish preposition is called a "case".

In the second place, you can clearly see fusion in the English verb be. You can see it less clearly in other places - wikipedia's page on analytic languages calls out the third-person singular present verb ending for simultaneously encoding all three of those contrasts.

But I would say you're right in spirit that those are vestigial elements of the language. English verb structure looks very agglutinative to me; the biggest objection (which SIL's definition doesn't mention) would be that auxiliary verbs still inflect.

In particular, this:

> Words in agglutinative languages can be, but [do] not necessarily have to be, analyzed as a main morpheme ("word") with dependent morphemes attached ("affixes").

is actually the standard view of English verbs (except that the auxiliary verbs are not thought of as affixes), still taught in school, but contradicted by syntax classes that say that a dependent element shouldn't control the form of the element from which it depends. And then uncontradicted by practicing linguists who feel that we might as well follow the obvious semantic dependence.

Another objection, which I find more persuasive than "agglutinative particles shouldn't inflect", is that the meaning of a particular English word form isn't necessarily very tightly determined by the form. So in he is painting a picture, the -ing element we see on painting is fundamentally there to agree with the continuous aspect marker is, and it has other meanings in other contexts. In he likes painting pictures, the same element is there to derive a noun from the verb.

And another objection might be that the languages we call agglutinative commonly incorporate subject and object into the interior of the verb, surrounded by other affixes, which isn't done in English unless you want to count phrasal verbs. ;D

I am undisturbed by the ambiguity; you might note that I led with the observation that agglutinative languages aren't well-defined in the first place.

None of this helps to explain why there might be a conflict between Markdown and agglutination, though.

lifthrasiir 4 days ago | parent | next [-]

I'm not here for arguing against linguistic concepts, so let me cover just one thing:

> None of this helps to explain why there might be a conflict between Markdown and agglutination, though.

reST, not Markdown. (Yeah I totally get it though because I made the same mistake in the OP!) Those languages often need to highlight individual morphemes inside space-separated "words", but reST assumes space-separated "word" as a default, hence annoyance.

mattclarkdotnet 4 days ago | parent | prev [-]

It’s amazing anyone can read, speak or write such a language!

thaumasiotes 3 days ago | parent [-]

Huh?

chrismorgan 4 days ago | parent | prev | next [-]

The key here is whether there’s a word separator, not agglutinativity or isolation. The term I find for this on a brief search is scriptio continua <https://en.wikipedia.org/wiki/Scriptio_continua>.

lifthrasiir 4 days ago | parent [-]

Yeah that would be a better way to phrase my opinion. Chinese is highly isolating but doesn't use spacing due to its writing system and therefore is heavily affected by this issue.

mattclarkdotnet 4 days ago | parent | prev [-]

These are descriptive terms though? It’s not like the language actually works that way

do_not_redeem 4 days ago | parent | prev | next [-]

What do you mean not designed for Korean? It's just unicode. If there's some situation where RST isn't parsing inline markup, you can write the role explicitly like this:

  this is **bold** text
  this is :strong:`bold` text
rune-space 4 days ago | parent | next [-]

But you can’t say:

   thisis:strong:`bold`text

Whereas the equivalent is perfectly fine in markdown.

Falsehoods programmers believe about written language: whitespace is used to separate atomic sequences of runes.

thaumasiotes 4 days ago | parent [-]

> Falsehoods programmers believe about written language: whitespace is used to separate atomic sequences of runes.

Really? That isn't just untrue of written language in general. It's untrue of every individual written language in specific. You can't even clearly define what an "atomic sequence of glyphs" is.

matja 4 days ago | parent | next [-]

> You can't even clearly define what an "atomic sequence of glyphs" is.

Kinda. Grapheme cluster breaks are defined in Unicode, but they have all the baggage and edge-cases you'd expect from human languages evolving over time, so they can be encoded in as a few as a thousand rules : https://github.com/unicode-org/icu/tree/main/icu4c/source/da...

rune-space 3 days ago | parent | prev [-]

Which makes one wonder why REST puts so much weight on them being divided by WS!

lifthrasiir 4 days ago | parent | prev [-]

reST inline syntaxes are pretty much word-based, which doesn't work very well with agglutinative languages. For example if you want to apply a markup to "이 페이지" in "이 페이지는 ..." (lit. This page in This page is ...), you need to do `*이 페이지*\ 는 ...` AFAIK. That would happen every single time affixes are used, and affixes are extremely frequent in such languages.

do_not_redeem 4 days ago | parent [-]

Oh I see, you're talking about this:

  thisis\ **bold**\ text
  thisis\ :strong:`bold`\ text
It's possible, but you're right, definitely more awkward than markdown.
chrismorgan 4 days ago | parent | prev [-]

reStructuredText and Markdown both have a bad habit of clevernesses that fall down—just in different areas.

Both do at least some degree of only matching delimiters at word boundaries. I consider that to be a huge mistake.

reStructuredText falls for it, but has a universally-applicable workaround (backslash-space as a separator—note that it is not an escaped space, as you might reasonably expect: it’s special-cased to expand to nothing but a syntax separator).

Markdown falls for it inconsistently, which as a user of languages that divide words with spaces, is honestly worse. Its rules are more nuanced, which is generally a bad thing, because it makes it harder to build the appropriate mental model. It was also wildly underspecified, though that’s mostly settled now. For many years, Stack Overflow used at least two, I think three but I can’t remember where the third would have been, mutually-incompatible engines, and underscores and mid-word formatting were a total mess. Python in particular suffered—for many years, in comments it was impossible to get plain-text (i.e. not `-wrapped code) __init__.

In CommonMark, _abc_ and *abc* get you abc, but a*b*c gets you abc while a_b_c gets you a_b_c. That’s an admission of failure in syntax. Hmm… I hadn’t thought of this, but I suppose that makes _ basically untenable in languages with no word separator. Interesting argument against Prettier, which has a badly broken Markdown mode¹, and which insists on _ for emphasis, not *.

In my own lightweight markup language I’ve been steadily making and using for my own stuff for the last five years or so, there’s nothing about word boundaries. a*b*c is abc, and if a dialect² defined _ as emphasis, a_b_c would be abc.

Another example of the cleverness problem in reStructuredText is how hard wrapping is handled. https://docutils.sourceforge.io/docs/ref/rst/restructuredtex... is a good example of how badly wrong this can go. (Markdown has related issues, but a little more constrained. A mid-paragraph line starting with “1. ” or “- ”—both plausible, and the latter certain to occur eventually if you use - as a dash—will start a list.) The solution here is to reject column-based hard-wrapping as a terrible idea. Yes, this is a case where the markup language should tell people “you’re doing it wrong”, because otherwise the markup language will either mangle your content, or become bad; or more likely both.

Meanwhile in Markdown, it tries to be clever around specific HTML tags and just becomes hard to predict.

—⁂—

¹ Prettier’s Markdown formatting is known to mangle content, particularly around underscores and asterisks, and they haven’t done anything about it. The first time I accidentally used it it deleted the rest of a file after some messy bad emphasis stuff from a WYSIWYG HTML → Markdown conversion. That was when I discovered .prettierignore is almost completely broken, too. I came away no longer just unimpressed with some of Prettier’s opinions, but severely unimpressed with the rest of it technically. Why they haven’t disabled it until such things are fixed, I don’t know.

² There’s very little fundamental syntax in it: line break, indent and parsing CSS Counter Styles is about it. The rest is all defined in dialects, for easy extension.