Why Markdown emphasis fails in CJK: A deep dive into CommonMark's delimiter rules
45 points by hongminhee
45 points by hongminhee
Personally, I haven't seen any (lite) markup language that has good support for CJK languages.
For example, many of those, including Markdown, reST, Org-mode, and HTML, treat consecutive text lines as a whole paragraph:
This is
the same paragraph.
This is useful if you're more used to limiting your texts to the "80-or-so column" limit instead of putting them all on a single line. (And personally I prefer putting one sentence per line so that git diffs are more semantic.) However, the example above works because the parser can simply turn consecutive spaces into a single one: "This is the same paragraph". But in Chinese and Japanese texts, spaces are rare and the same approach could produce unexpected space deli mited text.
The problem is that all these markups build upon the assumption that words are delimited by spaces and of course problems show up in a space-deficient environment. In the case of Org-mode, a common workaround is to use zero width space (well, in Emacs it's not that hard to bind a mode-specific key to insert ZWSP and strip them on export), and I see the developer is also looking into this issue (Chinese (and potentially Japanese) text inline markup in Org mode), so I hope the situation can get better.
(Another markup-irrelevant issue is with italics. With spaces in between, italics mostly do fine in English texts. But once you remove the spaces, like 强调文本, it suddenly becomes very crowded with characters overlapping with each other. And I've seen people doing game translation adding space before </i> tags to avoid this. But I guess it's more a "text rendering hates you" issue.)
I don't think italics make sense for CJK in the first place...
That's an interesting issue.
In the old days, we used italics <i> tags to specify italics, bold <b> tags to specify bold, and underline <u> tags to specify underline. The tags were presentational, the semantic implications of these text styles were left to the reader. You would naturally therefore not use <i> when writing a language where italics doesn't make sense.
However, these days, we use ✨semantic✨ HTML. <i> is replaced with <em>, representing semantic "emphasis"; and <b> is replaced with <strong>, representing semantic "strong emphasis". Underline is removed. This, theoretically, should allow the user agent to represent emphasis in whatever way makes sense in context. But if they all use <em> to mean italics even when writing languages where italics is not used for emphasis, this whole "semantic HTML" thing kinda falls apart.
I don't know how you would solve this. Maybe indicating semantic level of emphasis instead of stylistic italics and bold was the wrong move. Or maybe the overwhelming cultural force of western standard bodies will slowly shift CJK languages to treat italics as emphasis.
And there’s the extra layer of irony that Markdown is based on typewriter and email / usenet conventions for imitating visual typographic styling. Markdown processors convert the visual styling into “semantic” tags (that might or might not correspond to the intended meaning) only for HTML to convert the semantic tags back to typographic styling.
For instance, it’s conventional (but a bit old fashioned) to set borrowed foreign words and phrases in italic, but these italics don’t indicate emphasis. HTML doesn’t provide any markup that matches this usage better than <i>.
See https://developer.mozilla.org/en-US/docs/Web/CSS/Reference/Properties/text-emphasis-position . In theory, you should be using CSS to make <em> into emphasis dots, but in practice, the web just anglicized Japanese typography.
Does anyone here know what east asian language forums put in their CSS for <em> tags? Do they just not use that tag because it isn't useful for them?
In Chinese emphasis is sometimes done by using a different typeface. e.g. Kai for emphasized, Song for normal text.
I wonder the same thing. From some googling, it looks like there are typographical conventions for emphasis in East Asian languages:
https://en.wikipedia.org/wiki/Emphasis_mark
How these are used in HTML is unkown for me.
In my experience these are almost never used on the Web. The only place I've seen them are exam papers.
Another markup-irrelevant issue is with italics. With spaces in between, italics mostly do fine in English texts. But once you remove the spaces, like 强调文本, it suddenly becomes very crowded with characters overlapping with each other. And I've seen people doing game translation adding space before </i> tags to avoid this.
I'd say it's a pretty big failure of text rendering if inter-glyph spacing isn't adjusted to account for the italics! Though, a common issue, unfortunately. If I say hey! *hey*!: hey! hey!, then at least on my installation of Firefox, the italicized y overlaps the exclamation mark by about a pixel, while the upright y has a little space after it.
Huh, that's interesting.
When I started playing with Gemini, I was very surprised about how gemtext handles line breaks, which is unlike all the other lightweight markup languages I know.
I thought it helps if you're writing poetry or similar stuff where proper formatting requires <br>. But perhaps it's also better for the problem you mention?
(However, I really like one-sentence-per-line editing. I wonder if there's any equivalent for languages like Chinese and Japanese.)
With multilanguage issues, however, is where gemtext has limitations. There is no way to annotate which language is used by each chunk of text. I don't think anyone does this in the formats that allow it, but annotating languages could help? I would expect there are CSS selectors for language?
I don't think anyone does this in the formats that allow it
Data point of one, but I do it. For instance <i lang="fr" title="French Onion Soup">Soupe à L’Oignon gratinee</i> (edited to add: or did I misunderstand your statement as meaning markup languages other than HTML?).
I would expect there are CSS selectors for language?
Kind of. You can select tags with the lang attribute (p[lang]), an exact language (p[lang="en-GB"]) or a more fuzzy matching (p[lang|="en"] for all dialects of English).
[lang] is not the right tool if you’re changing text styles (e.g. font or colour) based on language. :lang(…) is more generally useful, taking into account things like ancestors’ lang attributes or a content-language header, and also handling subtags nicely (:lang(en) is the spiritual equivalent of [lang|="en"]; the [attr |= value] attribute selector should be considered obsolete). :lang is universally supported (IE 8 was the last browser to implement it).
(Related is the significantly more important :dir(…) which is far more useful than [dir=…] for similar reasons, but which has historically been neglected because Chromium and Safari didn’t implement it until 2023, despite Firefox having had it since 2016, or since 2012 spelled :-moz-dir(…). But by most entities’ feature baseline standards, it’s now safe to rely on, with two years in Chromium and almost three in Safari.)
This is a good example of a markup format having been designed with not all human languages in mind, and making some languages harder to mark up as a result.
Some additional details I found by experimenting with the commonmark.js dingus:
Though the article only mentions ** syntax, the same problem happens with the alternative __ syntax for strong emphasis, as well as with the * or _ delimiters for (normal) emphasis.
There are workarounds for the problem.
<strong> tags.** delimiter and the adjacent ordinary character.
See those results by rendering this Markdown with the dingus:
Strong emphasis not parsed:
- **마크다운(Markdown)**은
- **마크다운(Markdown)**은 (zero width space)
- __마크다운(Markdown)__은 (`__` delimiter)
- *마크다운(Markdown)*은 (`*` delimiter)
- _마크다운(Markdown)_은 (`_` delimiter)
- このような**[状況](...)は**
- このような**[状況](...)は**
- このような__[状況](...)は__
Emphasis not parsed either:
- このような*[状況](...)は*
- このような_[状況](...)は_
Strong emphasis parsed:
- <strong>마크다운(Markdown)</strong>은 (`<strong>` tags)
- **마크다운(Markdown)** 은 (hair space)
- このような<strong>[状況](...)は</strong>
- このような **[状況](...)は**
When I first glanced at your comment I thought you were calling one of the commonmark.js contributors a dingus :)
This type of misbegotten cleverness is unreasonably popular in lightweight markup languages (LMLs). It affects scriptio continua (where you’re not using a word separator) the most, but also other places: it’s perfectly reasonable to want to emphasise part of a word.
Markdown is an inconsistent mess, though at least these days you’re almost always dealing with the basic CommonMark rules, limiting the variation of the madness so that the biggest weirdness remaining is the difference between underscore and asterisk (a_b_c → a_b_c, a*b*c → abc). Some desired stylings are inexpressible, which is bad. Falling back to HTML is not always permitted, and even when it is it’s ugly.
reStructuredText has the same basic problem, but with more consistent rules and a universally-applicable workaround: backslash-space acts as a word boundary, so you can make a word **part**\ ly bold. It’s a weird syntax choice, though, because it’s not an escaped space like any other backslash escape—it doesn’t emit anything.
AsciiDoc defaults to constraining to word boundaries, but lets you double the delimiter to get unconstrained. This is… a choice. AsciiDoc has a lot of esoteric and idiosyncratic syntax. It’s terse and hard to learn, the Perl of LMLs.
I’ve been making and using my own lightweight markup language for five years or so, and I decided very early on on a philosophy of simple rules, even if nuanced rules may often be more convenient.
In my language, nothing cares about word boundaries, and *…* is just emphasis wherever it occurs. An unclosed delimiter will produce something like an HTML parse error (which don’t stop processing by default and have defined behaviour). a * b = c could be expressed in XML as something like a <em> b = c</em><?parse-error "unterminated *"?>.
reStructuredText frames the rationale for these features thus: “The inline markup recognition rules were devised to allow 90% of non-markup uses of *, ` [aside: I don’t think it’s possible to produce <code>`</code> here on Lobsters, which is a related problem in Markdown], _, and | without escaping.” I just think that’s a mistake of a goal. A simple rule is easier to understand and apply. People make errors in both directions all the time, you can’t really stop that, so I reckon simplicity is better. Especially when it’s maximally-flexible, and lets you write less code and parse faster.
A related area: hard-wrapping. At present, every markup language that supports hard-wrapping gets it wrong for all forms of scriptio continua, and also for cases like em dashes at the start or end of lines—and that was the issue over which I decided to make my own LML. They all turn the line break into a space, which is not always appropriate. HTML/CSS leaves scope for a better implementation, but no user agent has done anything better yet.
The cleverness angle also bites LMLs on paragraph boundaries, because of hard-wrapping. Markdown and reStructuredText both handle list markers in crazy but different ways. Consider https://docutils.sourceforge.io/docs/ref/rst/restructuredtext.html#enumerated-lists:~:text=enumerators.-,For,dude.,-Caution. Too clever. This is also an area where I believe my LML is better than all before (in a more interesting way which I won’t describe here).
Typst seems to handle this rather fine, which makes me happy, as I've already switched to it as my LML of choice.
Only partly, by my standards. Typst docs on strong:
To strongly emphasize content, simply enclose it in stars/asterisks (
*). Note that this only works at word boundaries. To strongly emphasize part of a word, you have to use the function.
What is a word boundary? They don’t say. You might expect it to match UAX #29 word boundaries, but it doesn’t. Instead, it checks the preceding and succeeding source characters (that is, including Typst syntax characters; somewhat eww, but understandable in its parser design) and disqualifies * or _ if both of those characters are “wordy”, which used to mean just char::is_alphanumeric() but now also just disqualifies the scripts Han, Hiragana, Katakana and Hangul. In other words: it’s a bandaid on a bad definition.
And it still doesn’t support marking up part of a word properly. Well, I suppose you can write something like _part_#[]ly instead of #emph[part]ly if you want.
All this excessive cleverness also only applies to * and _. Not $, not `, not #. It’s inconsistent.
So, it’s the same old half-baked, unnecessary, foolish cleverness, unfortunately.
Oh, silly me. I should've read the docs instead of just trying a bunch of things. Thanks for the correction.
I'm gonna write up a request to change this.
I don’t think it’s possible to produce
<code>`</code>here on Lobsters, which is a related problem in Markdown
The Markdown `` ` `` produces that. If you add single spaces inside ` delimiters, CommonMark renderers will strip them. You can also represent space-surrounded code like ‘ spaced ’ by adding two spaces inside the ` delimiters.
CommonMark’s reference implementation supports marking up code surrounded by any number of spaces. However, it seems that the renderer Lobsters uses has a bug where it collapses multiple visible spaces into a single visible space, whether the spaces are at the boundaries of the code or inside it. Example of this bug: two spaces in ( ).
Ah, thanks. I had quite forgotten about that, though I’m confident I knew it once, because I did delve deep into it all five years ago.
a bug where it collapses multiple visible spaces into a single visible space
That’s actually just a CSS thing: add code { white-space: pre-wrap } or similar, the spaces are there. (I remember being surprised when I first learned that the UA default stylesheets only preserve whitespace for pre (and listing/plaintext/xmp), not for code (and kbd/samp/tt). It’s not unreasonable to want to include a significant leading or trailing space in a code element, or multiple internal spaces.)
You’re right, Lobster’s Markdown renderer does output multiple spaces correctly. I was misled by Chrome’s dev tools, whose Elements view displays text content with collapsed spaces until you double click to edit the text. The Elements view shows spaces as collapsed even for text within a pre element that keeps the default style white-space-collapse: preserve, such as this element:
two spaces in ( )
I found that there is a project github.com/tats-u/markdown-cjk-friendly that tries to fix this problem. It includes a suggested patch to the CommonMark specification as well plugins for some JavaScript Markdown renderers. The patched specification says to treat CJK characters differently from other characters when parsing emphasis.
The Comrak Markdown renderer, which Lobsters uses via Commonmarker, supports that specification as an extension named cjk-friendly-emphasis. Comrak disables that extension by default, and Lobsters does not enable it.