The Only Two Markup Languages
75 points by heavyrain266
75 points by heavyrain266
It's nice to see more discussions of concrete syntax in markup languages! I've refactored most of Typst's current parser, so I have a lot of appreciation for the nuance of markup language syntax. Here are three thoughts I had while reading this:
First, to go along with the broad families, I would argue for including a third family based on Lisp or TcL (Forth/Shell are similar, but I'll just write out these). For example, here's how I would respectively render some nested markup: Lisp: (foo (attrib value) '(,(bar 'wrapped) text)) TcL: [foo -attrib $value "[bar "wrapped"] text"]. The defining difference of these syntaxes to the others is the more explicit quoting of text content and the more implicit use of nodes/variables. While this is approaching the explicitness of a normal programming language, I think this approach is underrated for a markup language syntax.
Second, a serious pitfall of the TeX and SGML syntax is the stringly-typed nature of attributes. Under this posts's analysis, Typst's syntax would fall under the TeX family: #foo(attrib: value)[#bar[wrapped] text], but a notable difference for Typst is that here value is not just a string, but an expression in Typst's code-mode syntax. This enables far greater programming possibilities in Typst and allows it to embed a lightweight markup language alongside its heavier markup syntax. Having this in HTML would be the equivalent of allowing JavaScript expressions in attributes, like <foo attrib=1 + object.key[3] />. Indeed, JSX deserves mention for allowing this kind of code interpolation in HTML markup already.
Finally, for non-programming or non-text-based use cases (e.g. data transfer, configuration) KDL would be my recommendation for a modern take on the SGML syntax. Ex: foo attrib=value { bar { text "wrapped" }; text "text" }. KDL has actual non-string value types, a generic type-annotation syntax, and a stable standard definition. The example here suffers from being written inline and marking up text specifically, but the explicitness of the language and its kinder syntax make it a worthy successor to XML.
Regarding Lisp, a syntax for attributes I’ve seen sometimes (for example in Janet) is (foo :attrib value :another-attrib value wrapped-items…)
Haven't used Janet, but my experience is that syntax tends to have issues when you want to define some attribute+value pairs programmatically. If (a b ...) is a list, and you have a snippet saved-attribs of (:attrib1 x :attrib2 y), and you want to splice it into the attributes for (foo (:bar val :baz val) thing), if both are explicitly lists then you can always do something spiritually similar to (foo (concat (:bar val :baz val) saved-attribs) thing) Or maybe (foo (:bar val :baz val .. saved-attribs) thing) if you want it to be declarative rather than a function.
If the attributes are some kind of flat expression rather than an explicit list, it's much more of a pain in the butt to splice lists in, you need some sort of equivalent of CL's , macro, and write (foo :bar val :baz val ,saved-attribs thing). Looks nice and easy, sure! Then you have some annoying magic to parse it that fails when you accidentally have an odd number of key-value elements, and if you want to build the attribute list out of more complicated parts you have to revert to something like the first form anyway, and really it would be easier to deal with and validate if you wrote it (foo thing :bar val :baz val...) but then that gets difficult to read when thing is deeply nested, and on and on and on.
Elixir deals with it in a half-decent way, but imo it really is more trouble than it's worth. I'll check out Janet for real someday.
Unfortunately, I don't class s-expression/Lisp style stuff as being an "arbitrary proper markup language", at least by default. The main reason is that you cannot just "markup" pre-existing plain text with it that easily, you effectively have to restructure the text completely for it work. At best you'd need to add a second syntax to make it clear what are the attributes vs what are the plain-text.
As another comment states, Janet uses : to prefix the attribute-like parameters, and here's
[foo :attrib value :another-attrib value `the plain text`]
A very minor but other obvious thing is the use of parentheses is a bad choice since they are commonly used within plain text, but if you just swapped it for [] or {}, it would be fine, and most people would still recognize the s-expression nature. And the need for something like backticks to wrap the text might need to be there to remove the ambiguity.
This what I think the entire motivation behind the original markup languages too: marking up pre-existing plain text documents. Rather than starting from the "markup" language and adding text. This latter form is probably closer to just a structured text format rather than a markup language.
To clarify a lot, I am not saying you cannot have different syntaxes that work and have them be used for structured formatting, rather I don't class them as "arbitrary proper markup languages". And the question always is: are they a good idea any more for vast majority of people's needs?
Janet has a "markup language" syntax built on top of it's usual S-expression syntax called mdz - github source - which I wrote to help author documentation. The syntax emphasizes text and other internal forms that need to be escaped - it is loosely based on Racket's Scribble, which itself might be based on Scribe, so I would consider it in the Tex family by the original articles classification.
To be clear, the tool its focused on generating HTML but the general syntax is something fairly easy to construct and could be universal.
It is a "proper" markup language but I'd argue Janet is not an "arbitrary" markup language since you cannot start from plain text and mark-it-up. Instead you'd call it something like a structured rich text format, rather than what is described in the article.
True, Lisp is not what I would call text markup. But deriving one from them is almost trivial, and has been done many times in many places with a "second syntax", usually just a designated escape character like Tex's backslash or Scribe's at-sign.
I may as well pose this directly (since I commented on this to the parent comment) - do you consider typst a proper arbitrary markup language, and if so do you consider it to be in the tex family?
(It seems to me that "wrapped text" always needs to be properly escaped to enable arbitrary nesting of markup, but markup languages make collisions sufficiently uncommon that you rarely consider escaping of content at all - does the nature of escaping define the language family?)
Typst in my opinion is a great "markup"/structured-text language but it is neither arbitrary nor proper because it has intrinsic procedural semantic meaning to the syntax nor can it wrap arbitrary plain text. I'd argue we need a better term for such languages as "markup" isn't the clearest---it's closer to a structured/rich text format.
I am planning on writing a follow-up article which will go into the actual different kinds of structured texted languages and the ones people want to use for different purposes. I will mention Typst and how for the vast majority of people's needs, it replaces TeX/LaTeX/et al.
Emphasis mine:
This does mean I am excluding things like Markdown , troff, IBM’s GML, Wiki, Emacs Org-Mode etc. The reason for this exclusion is because they are neither arbitrary nor proper. They have procedural semantic meaning to their syntax and it cannot be arbitrarily extended.
For example in Markdown, text has a very specific intrinsic semantic meaning. Whilst in SGML/XML, <foo>blah</foo> has no intrinsic semantic meaning without some extra program to enforce that.
I feel like this is a key distinction re: your second point. Once the text has syntax-directed semantic meaning to two human readers, it falls into this second tier.
I happened to read this article and start using typst with 24 hours of each other.
It really drove home what I loved about # - I was at the very first confused why sometimes I want #v(2pt) and sometimes v(2pt) but it totally makes sense, and even more so in the context of this article.
I wonder if we should consider typst a third markup language family or not? Obviously a lot of thought was put into why tex is useful and how to make the syntax more modern - and yet typst doesn't "look" like tex.
Is it missing any of the ingredients of a "proper arbitrary markup language"?
Regarding Tcl, there's the doctools markup language, complete with formal syntax definition and generic parser.
Notably, the authors adopted a similar classification as ~gingerBill's and they describe doctools as LaTeX-like, instead of like SGML and similar languages.
I cannot think of a case when overlapping hierarchies this is desired
Overlapping markup is actually quite common. My preferred example is Bibles, where you have two simultaneous hierarchies: sections/paragraphs and book/chapter/verse.
The three approaches are:
Close your ears and pretend the two hierarchies mesh. This used to be done more often, especially as the KJV was typically set in verse-per-line form with the pilcrow (¶) marking the beginning of a paragraph, and especially because they did mesh in the KJV, but no one serious does it this way any more.
Focus on visual presentation, and just represent the chapter and verse start markers—roughly what paper does. USFM (one of the most common formats for representing Bibles) works this way: \v 34 . But suppose verse 35 starts a new section with a (non-canonical) heading above it. You’ll have to choose heuristics to decide where verse 34 was supposed to end, if you want to retrieve the text for a verse range. But USFM 3 did add the concept of milestones for some purposes.
Record start and end as atoms. OSIS (another common format, focused more on rigour of representation) makes various elements “milestoneable”. You can write <verse osisID="Gen.12.34">…</verse>, but it’s generally discouraged: you should normally use the milestone form instead, <verse osisID="Gen.12.34" sID="v1"/>…<verse eID="v1"/>. You can thus keep that heading out of verse 34.
See also Wikipedia on overlapping markup.
Fun extra fact: in early versions of IE, the DOM actually wasn’t a tree. If I recall correctly, it was still possible as late as IE 6 to produce irregular (overlapping or multi-parented) DOM.
I cannot think of a case when overlapping hierarchies this is desired
Overlapping markup is actually quite common. My preferred example is Bibles, where you have two simultaneous hierarchies: sections/paragraphs and book/chapter/verse.
Or, more simply: pages and paragraphs in OCR'd books.
<page>
<p>It was a dark and stormy night.</p>
<p>"How can I take you anyplace
</page>
<page>
when it's a dark and stormy night?"
he said.</p>
</page>
For extra fun: think how you could mark up the page number "1" that appears in the corner of the page after "anyplace". (If you simply add <page-num>1</page-num> before the first </page>, then you're saying that the page number is part of the second paragraph.)
think how you could mark up the page number "1" that appears in the corner of the page after "anyplace".
From a data perspective, it should be an attribute on the page. Hopefully you’ll be content with only a string, since XML attributes are limited to that.
Oh, I hate that. XD Just as a programmer trying to represent data, I hate that so much. Give me my nicely-nested trees, dammit, or get out!
Or any natural language text, where you have various layers of description like morphology, syntax, phonology, pragmatics, all of which give potentially partially overlapping trees. (And even within one layer like syntax, trees might not be good enough: https://en.wikipedia.org/wiki/Cross-serial_dependencies )
The kind of overlapping hierarchy I was talking about was the syntactical <a><b>text</a></b> example and not the concept of an overlapping hierarchy in general. The example you gave is a great use case for an actual use case BUT all of the solutions seem to not rely on the syntactical "flaw" but use standalone "nodes" to mark the beginning and end of the groupings.
Yeah, it seems to me that it’s useful to have some way of marking “ranges” at different granularities, but separate from the actual document structure used for presentation.
I'd say Lisp or rather S-expressions would be a third. Since the article mentions TeX, I'm thinking about Pollen by Matthew Butterick. Here is an example from the docs:
#lang pollen
◊headline{Pollen markup}
◊items{
◊item{You ◊strong{wanted} it — you ◊em{got} it.}
◊item{◊link["https://google.com/search?q=racket"]{search for Racket}}
}
Lisp or rather S-expressions
OK, but how do you do attributes? You could do something like
'('foo ())
'('foo () "wrapped text")
'('foo (('attrib "value")) "wrapped text")
'('foo (('attrib "value")))
But that's not a natural part of the syntax. S-expressions aren't markup for the same reason JSON isn't markup.
OK, but how do you do attributes?
GNU Guile's SXML module uses (@ ...) for attributes.
'(foo)
'(foo "wrapped text")
'(foo (@ (attrib "value")) "wrapped text")
'(foo (@ (attrib "value")))
SXML uses this because @ is not a valid element name, and it alludes to the word "attribute".
The Hiccup syntax for HTML widely used in the Clojure ecosystem looks like
[:span {:attrib "value"} "wrapped text"]
and is introduced like
The first element of the vector is used as the element name. The second attribute can optionally be a map, in which case it is used to supply the element's attributes. Every other element is considered part of the tag's body.
The example above would produce the following HTML:
<span attrib="value">wrapped text</span>
The Hiccup syntax would make it a "proper markup language" but not "arbitrary" to use my definitions. The reason being is that you cannot use to mark-up pre existing plain text. In a way, you have to start with that syntax and then add the text to it, rather than start with the plain text and then mark it up (which is what I believe is the entire motivation behind the original markup languages).
And the use of quotes means you would have to make sure to escape them within the text so that things can work. However I'd argue that quotes are very common in some European languages, and thus would require a lot of corrections. One of the benefits to both TeX and SGML is that they "start" with symbols which are not commonly used in every day text. Backslash seems to only exist since the dawn of computers and is never seen in actual text before (making it a near perfect choice) and less/greater-than are rarely used outside of mathematical texts making it also a good option too to use.
Nonetheless, would it be fair to say that truly “arbitrary” text (Unicode strings) must be escaped according to their level of nesting?
And that arbitrariness is a matter of degrees - how commonly human texts require such escaping?
(Actually - is it true that XML/SGML-family grammars don’t rely on the nesting depth for escaping while something like “nested-markup-in-JSON” does?)
Escaped according to their level of nesting? I am not even sure what this means.
I think the point I am trying to make is quite simple. Take some plain text (whatever encoding or language), and can you mark it up to add extra information to it? If you can do that with your new syntax, with a minimal need to escape specific characters, then you have an arbitrary markup language. If that language also supports a general syntax for attributes, then it's also an arbitrary proper markup language.
The level of "nesting" is irrelevant to this point.
I agree that JSON is not markup. It is "object notation". S-Expressions are similarly object notation.
Pollen actually uses X-expressions. There are various ways of expressing attributes in the Lisp universe, we would probably consider them all a family which belongs together. After all, it is customary to create your own syntax in that community.
A former colleague once defined Lisp as "the language where you write your AST yourself". 😉
I would actually probably place Pollen in the TeX family for this discussion. While it does evaluate to S-expressions semantically, syntactically there's very little difference to TeX.
YAML is actually not a superset of JSON. YAML 1.2 says it is, but theres two issues. JSON unicode escapes are UTF-16, while YAML has UTF-8 unicode escapes but not the JSON escape syntax. AFAICT, most YAML parsers do not accept JSON unicode escapes in strings. Additionally, YAML 1.2 seems to require a YAML 1.2 header to be treated as YAML 1.2, which means a json doc would not be YAML 1.2.
As I imply in my article, all this is telling me is to NEVER use YAML for goodness sake. YAML is a mistake that needs to be destroy in the fires that created it.
I'm honestly not exactly sure what the motivation was for the YAML folk to put that in the spec
You could make this comment on about half the shit in YAML TBH.
I will forever be baffled by the continued insistence on using YAML.
I don't know why they did it, but every time I have the misfortune of needing to write a yaml file and yet am able to not write a yaml file I breathe a silent thanks for it.
It is rather nice to just be able to output a JSON file to generate input to a tool that expects YAML (looking at you, kubernetes), since JSON writers are a hell of a lot more common.
The article says:
the TeX family of syntaxes are much easier parse than the SGML family
It also suggests that \foo{wrapped text} and <foo>wrapped text</foo> are somewhat equivalent.
I suspect many are not aware how TeX parsing works. There is no generic parser for TeX. You cannot parse TeX without interpreting/running it. This is a big difference to SGML/XML/S-Expressions/... where a generic library can produce an AST data structure.
In the above case, TeX will recognize \foo and then execute it. The implementation decides if it consumes another token, i.e. {wrapped text}, or not or maybe only under certain conditions.
You can pretend that a certain command mean certain thing. For example, pandoc does this to convert from LaTeX to Markdown or whatever. If you extend TeX with some custom command e.g. \foo, then pandoc will not know how many parameters it takes. In contrast, a generic XML parser will still handle anything valid and it doesn't matter how many attributes there are.
So, TeX syntax might be "easier to parse" when determined visually subjectively by human. In the sense of "deterministically recognizing the structure" TeX in general is not easy.
Note I am saying the "TeX Family" and not actual TeX itself. I know how insane TeX is and I did not want to get into how its context sensitive its grammar really is. I just wanted to focus on the arbitrary proper markup syntax of it in isolation rather than the semantics of a specific language in the "TeX Family".
I've added this comment as a side/margin note. So thank you for the comment.
I agree that YAML is not a markup language and neither is JSON. But I think it's apples and pears. There is to me a difference between a markup language and a textual and literal representation/encoding of data. YAML, JSON (and EDN) are specifically encodings or expressions of data. To me, those exist in the same solution space as Avro and protobuf. You may use XML to transmit data, but generally markup implies some algorithm or interpreter to take it from the marked-up thing to the artifact. (edit: completed the thought)
I went a bit crazy with CSTML and let the tags just have a JSON object in them as attributes...
<Node { foo: { bar: null, baz: 'value' } }> 'innerText' </>
Plus, no named close tags means no overlapping spans.
Also fun fact, YAML is actually a supset of JSON which all valid JSON documents are also valid YAML documents.
This is only true of YAML 1.2 and beyond.
By arbitrary, I mean the grammar specifically, and how it can be used mark arbitrary plain text with information. And by proper, I mean the ability to have standalone nodes, user-definable nodes, nodes with attributes, and the wrapping of plain text.
I think there's a contradiction here. Plain text doesn't have a grammar, we can't "mark it up" unless it does.
I like that the text highlights orange and all the default UX still works.
In a weird way, plain text kind of does have a sort of pseudo-grammar that people assume of it and it is because of whitespace. Newlines are treated special kind of whitespace (usually multiple newlines signify a paragraph break) and general spaces are only separating other "things" (other characters).
For many people, the amount of spaces used is just for "alignment" or "stylization" and doesn't have any semantic meaning. This assumption is why an arbitrary (not necessarily proper) markup language can exist.
I don’t think this sufficiently motivates the underlying framing: why should I care if a given markup language meets these definitions of “proper” and “arbitrary”? There might be a good reason, but it’s left unstated.
There appears to be empirical evidence that markup languages can be and are successful without having these qualities (and going further, it seems like markup languages that don’t have these qualities tend to be very popular).
Because that wasn't the motivation of the article in the slightest. I just want to describe a specific set of markup language families and nothing more. It's not meant to be a "motivational" article, rather something to show a distinction which many overlook.
As for your questions, they are good and I can answer them, with explain why I made the distinctions.
"Markup" languages historically were always though as must marking up plain text. Short answer, the other alternatives that you might think are "empirically good" are examples of things which are not "markup" languages but actually something else. Maybe better described as a "structured rich text format" instead, where the markup was not added on later.
Many modern "pseudo-markup" languages are better than the arbitrary proper markup languages because they are domain specific. They are better for the task at hand. But they make many concessions and as I say, not actualy even "markup" languages over "plain text".
SGML … “wrapping of plain text”? What? Oh wait, they mean “enclosing”. Wrapping is something else entirely.
To wrap something in something else is perfectly normal usage of the word "wrap". You're being pedantic to the point of incorrectness.
Wrap is of Germanic origin and Enclose is of Latin origin. They mean the same thing. Please don't be pedantic for something you are wrong about, and only have a idiosyncratic distinction in your own mind.