Why Markdown emphasis fails in CJK: A deep dive into CommonMark's delimiter rules

45 points by hongminhee


kana

Personally, I haven't seen any (lite) markup language that has good support for CJK languages.

For example, many of those, including Markdown, reST, Org-mode, and HTML, treat consecutive text lines as a whole paragraph:

This is
the same paragraph.

This is useful if you're more used to limiting your texts to the "80-or-so column" limit instead of putting them all on a single line. (And personally I prefer putting one sentence per line so that git diffs are more semantic.) However, the example above works because the parser can simply turn consecutive spaces into a single one: "This is the same paragraph". But in Chinese and Japanese texts, spaces are rare and the same approach could produce unexpected space deli mited text.

The problem is that all these markups build upon the assumption that words are delimited by spaces and of course problems show up in a space-deficient environment. In the case of Org-mode, a common workaround is to use zero width space (well, in Emacs it's not that hard to bind a mode-specific key to insert ZWSP and strip them on export), and I see the developer is also looking into this issue (Chinese (and potentially Japanese) text inline markup in Org mode), so I hope the situation can get better.

(Another markup-irrelevant issue is with italics. With spaces in between, italics mostly do fine in English texts. But once you remove the spaces, like 强调文本, it suddenly becomes very crowded with characters overlapping with each other. And I've seen people doing game translation adding space before </i> tags to avoid this. But I guess it's more a "text rendering hates you" issue.)