In Defense of YAML
41 points by nrposner
41 points by nrposner
It's odd to describe YAML as arising to fill a gap in JSON. I have a recollection of YAML coming out of the perl community, which I thought suggests that it might be as old or even older than JSON.
So I went digging to find the actual history, and found quite an interesting post from Ingy dot Net, one of the creators of YAML, which puts its genesis somewhere between November 1999 and January 2001.
I also found another post on the history JSON, which puts a primordial form of JSON (using a hidden iframe to send a message embedded in a <script> tag from the server to client) at April 2001, though the technique might be older than that. JSON did not become a format in its own right until 2002 when Douglas Crockford published json.org.
So YAML seems to slightly predate JSON, though more charitably it would be fair to suggest that they developed concurrently. It's not accurate to describe it as a reaction to JSON or arising to fill a gap in JSON. Instead, YAML was actually a reaction to XML. JSON's reason for becoming was more practical - literally embedding data in <script> tags - though it too was well placed as a simpler alternative to XML when the pendulum started swinging back, and it is undeniable that this played a large role in its popularity.
Anyway, the article has hints of LLM writing imo. I see that the project it's announcing is vibecoded too.
It's correct to say that YAML was originally a reaction and alternative to XML.
Yes, it most certainly was. XML workflows were considered slow back then because the most reliable tools were written in Java. Since many devs were running < 100 MHz single-core 32-bit processors with < 100 MB RAM, processing XML in Java was really slow. But XML did give you wider characters than 8 bits by default, something completely taken for granted now that could cause huge problems late last century. DKOI (EBCDIC encoding for Russian Cyrillic) to ISO 8859-5? Don't mind if I don't.
YAML didn't really need to exist after XML processing got faster, but here we are. TOML is … ini files that got really drunk.
I got the same whiff of LLM in the YAML complexity criticism and later praise of the newer YAML spec: the complexity criticism cites the 1.2.2 spec being 4 levels deep and long, and then later the 1.2.2 spec is cited to be still sizeable but much less complicated than 1.1.
All plausible sentences, but confused in totality. Also arguing for TOML as a language with a specification, while critcising YAML about having a complex specification is again a little ugh. I guess YAML originally didn't have a spec while TOML did? Who knows.
One additional note on that history: Crockford was originally working with Data-E, a subset of E which only denotes plain old data. E's authors pivoted to working towards making ECMAScript capability-safe instead of pushing their own custom language, which led to ECMAScript adopting many ideas from E, like WeakMaps. JSON is merely ECMAScript-flavored Data-E. Closing out the note, we can see from archived pages on Data-E that they, too, were reacting to XML.
If I were to charitably reinterpret the article, I think it could be fair to say that YAML adoption was a reaction to the limitations of JSON, even if the development of YAML was not. I personally never heard about YAML until JSON was already ubiquitous. But I don't have data to support that perception.
With inline tables The "bad" TOML example becomes:
[services.web]
image = "nginx:latest"
environment = {
DB_HOST = "postgres",
DB_PORT = 5432,
}
resources.limits = {
memory = "512M",
cpu = "0.5",
}
Seems fine to me, fewer characters than the YAML example if we're golfing about it.
I believe that this is out of scope for the stated aims (to defend YAML), but I wish that this post discussed TOML 1.1 and its new inline tables feature. I find that it addresses my three main gripes with JSON:
Yes, it seems a bit odd to state the case against YAML is "an argument against a format that no longer exists in its problematic form" – which is fair enough – but then argues against an older TOML version.
Note that Python 3.15 supports TOML 1.1 and that uptake of TOML 1.1 seems much better than YAML 1.2. This is probably mainly because updating a TOML 1.0 parser to support 1.1 should be almost trivial, whereas updating YAML 1.1 to 1.2 probably not so much (I can't actually find a list of changes on yaml.org? Just two huge specs?)
I'll also add that things such as the "Norway Problem" are minor footnotes in my dislike of YAML; I just don't like how it can be finicky to edit, the significant whitespace, and is quite complex.
I dislike YAML because of things like the Norway problem (which is still a problem for strings like "", "Null", "true" and "FALSE" because of unquoted strings—you need a YAML-aware encoder) and its general complexity, even though YAML 1.2 dialled it back a little (a little); but you know the reason why I despise YAML?
When indentation is load-bearing, it’s fine to disallow mixed or inconsistent tabs and spaces; Python 3’s approach seems pretty good:
Indentation is rejected as inconsistent if a source file mixes tabs and spaces in a way that makes the meaning dependent on the worth of a tab in spaces; a
TabErroris raised in that case.
But YAML is the only file type that I can think of which, permitting indentation (optional or mandatory), doesn’t support both tabs and spaces.
(Oh, you say Make doesn’t support spaces? You can set .RECIPEPREFIX to something other than tab. I could even argue that the tab isn’t indentation, as Make uses it.)
But YAML is the only file type that I can think of which, permitting indentation (optional or mandatory), doesn’t support both tabs and spaces.
I can't find it now, but I am sure I have read a quote from Guido van Rossum where he said that one thing he would definitely change about Python if he was starting over would be to disallow literal tabs for indentation.
PEP-8 strongly recommends against using tabs for indentation:
Tabs should be used solely to remain consistent with code that is already indented with tabs.
one thing he would definitely change about Python if he was starting over would be to disallow literal tabs for indentation
Which goes to show that even very smart people can make dumb choices that negatively affect other people.
Insisting on spaces for indentation is like insisting that the font your IDE uses must be the same as everyone else's. There is (or was) a GH issue on one of the more popular "formatter" tools with a bunch of evidence to show that tabs are a net win for a bunch of people who use assistive technologies - not to mention just straight out being better at the job of indenting by one logical level.
I’m not going to attempt to argue for or against tabs, but I assume that van Rossums position arose from years of lived experience and feedback, and is not simply an arbitrary expression of preference.
Why would you assume that?
Literally the only argument for spaces is "I prefer indents to be exactly X characters wide FOR EVERYBODY.
There's no other logical argument to be made for their use, and just like YAML, people use what other people use "to fit in" not because it's actually the best option/solution.
Why would you assume that?
I assume that van Rossum has spent a lot more time thinking about Python than me.
There is nothing python specific about whether to use tabs or spaces.
But also keep in mind you're implicitly trusting someone who still believes that significant white space is not a terrible idea.
The fact they're significant in Python makes them substantially different from other languages where you can just tokenize them both as "whitespace" and collapse all consecutive whitespace into a single token.
Though in python's case I think it'd be a lot simpler if it had just mandated tabs for indentation. One tab equals one indent level. Simple for both the implementation and users. The only gotcha is copying from terminals when doing so doesn't preserve tabs.
A specific tab size setting inevitably becomes entrenched in any project as people start mixing tabs and spaces for indentation and alignment. The outcome here is that even if a project uses tabs, there's only one good tab size to view the code at, and any other tab size shows wrongly indented code.
Your entire argument is based on bad practice.
Indenting code should be about logical nesting. If you care about "alignment" you're doing something wrong.
You may as well specify the font colour the IDE will show keywords in.
You can like it or not, but it's a very popular style seen in many large, established, influential codebases and that is an empirically observable fact. That is the experience you are dismissing out of hand with your logical arguments.
I think you need to re-read my comments. I never once said these practices aren't popular. That isn't in debate here.
Lots of things are "very popular style seen in many large, established, influential codebases". The original topic - YAML is a perfect example of this. Being popular in tech doesn't make something explicitly good, it makes it popular.
Lots of things in general are "very popular" without actually being good.
If you want to go along with the crowd because it's popular you're quite welcome to do that - that's the defining characteristic of "popular" things.
Personally I choose to do things that make sense rather than just "because those people are doing it".
The outcome here is that even if a project uses tabs, there's only one good tab size to view the code at, and any other tab size shows wrongly indented code.
I use different tab sizes on different devices and situations. I typically use tabs of size 8, although for HTML I occasionally switch it to 6. For very deeply nested HTML, I sometimes switch to 4 or 5.
How could "any other tab size shows wrongly indented code"?
Do you have specific examples?
How could "any other tab size shows wrongly indented code"?
Some examples at https://lobste.rs/c/4oekdc
Tools like e.g. gofmt will use tabs for indentation and spaces for alignment, where it decides that's needed. I know Python has e.g. black for formatting, which I think is fairly wide-spread now? I agree this is a real problem (often present in C projects), but also a solvable one.
Similarly, have you looked at van Rossum's stated position? Are you sure there's no logical argument to be made? van Rossum is on the record as pro-tabs, so perhaps there's some not obvious subtlety to his opinion born from 35 years of experience?
No I haven't at all. I was vaguely aware there's a python standard of some kind that strongly recommends spaces. Beyond that I'm going by what the comment I replied to stated, and what you've now stated - which is that he apparently is in favour of tabs, but would also make them disallowed for indentation if he did it all again.
Based on what you've each stated, I've got $100 says his position about only allowing spaces is ultimately because it's "more popular" today to use spaces.
But YAML is the only file type that I can think of which, permitting indentation (optional or mandatory), doesn’t support both tabs and spaces.
FWIW, from the NestedText specification:
Only ASCII spaces are allowed in the indentation. Specifically, tabs and the various Unicode spaces are not allowed.
I feel like yaml, json, and toml all have their own niches. I’ve long felt a gap through that https://kdl.dev/ has filled for me.
json = arbitrarily nested data interchange
yaml = shallow container format where elements are expected to contain multiline strings
toml = almost entirely flat config files
kdl = ruby-style DSL as data
I've also settled on KDL for new projects needing configuration, mainly because it removes a whole lot of irrelevant delimiter syntax or indentation.
kdl is so nice! I keep looking for opportunities to use it. There are various circumstances where having both attributes and child nodes (as in XML) makes for much more readable markup and it's great to have an option with a lightweight syntax.
As soon as I saw the example I thought... hey this looks like SDLang... turns out that's not a coincidence! Thanks for referencing KDL, that's good to know about.
I don't think the argument "YAML isn't bad, it's just YAML 1.1 was bad, and also, there are mostly 1.1 parsers" works as well as the article wants.
I do enjoy YAML with some restraint, and I'm glad there's new performant parsers for YAML 1.2. But if the "bad" version of the thing is still used by the majority, I still would use something else. If I can't rely on my YAML being interpreted correctly (because I don't have control over the parser used), YAML (as a whole) is still "bad".
Everyone should probably migrate to 1.2, but until that's happened, I think it's fine to treat YAML with a bit of reservation.
The YAML-versus-TOML debate, as typically conducted, is an argument against a format that no longer exists in its problematic form. The complaints are real, but they are historical.
screams in Github Actions and Kubernetes
Strong defense. TOML is still nicer to read for really simple documents, and it's easier to get people to write toml than yaml.
Sadly there is a long tail of inertia to change public (developer) perception around a tool. People read a story and make up their mind, so they can get on with other tools that don't have public missteps.
and it's easier to get people to write toml than yaml.
Not once you go past 2 levels of organisation with arrays. Indentations make it way easier to understand the structure at that point.
not saying it's easier to write. Saying it's easier to get people to write it. Social not individual problem.
I wish config file formats allowed you to specify a standardised schema so my editor can take an arbitrary config file, point out typo'd keys, or mismatched types. It should be able to provide "hover" hints to document what a key is for. It should allow for easy completion of valid keys. Bonus points awarded if it also supports simple assertions and/or contracts to point out invalid values too. (eg the "color" key should match /#[a-fA-F0-9]{6}/).
Ideally, you'd use this to autogenerate the configuation file datastructures too.
You are just straight up describing XML
There are multiple validation specification formats like XSD and Relax NG but I am most familiar with XML DTD so I can't speak about the other ones
It's pretty common for JSON files to have a $schema property as a top-level key, which references a JSON Schema file which defines the correct schema. It's essentially XSD for JSON. A good editor will usually be able to do completions and red squiggles with that as the basis.
The nice thing about it is that because TOML and YAML are (roughly) JSON with different syntaxes, they can usually also use JSON Schema files as well.
a $schema property as a top-level key, which references a JSON Schema file
Be careful, there's an if statement in the specification for $schema that makes it only coincidence that the otherwise namespace URI turn into a URL. XML Schema has the same problem: that's why xsi:schemaLocation exists to translate URIs into URLs, and there was a discussion about ways to solve it https://github.com/orgs/json-schema-org/discussions/460
I wish the ruby yaml parser bundled with the interpreter supported yaml 1.2.2.
Alas, it's not clear to me how you can transition to a new version default without breaking the ecosystem.
In YAML, the indentation communicates the hierarchy at a glance
Maybe to more sensitive eyes, but without left-side indentation guides, it's hard for me to correctly read the entire hierarchy of the YAML structure with 2 space indentation.
Python demonstrated decades ago that indentation as structure is not a weakness but a strength
I agree, but Python tends to use 4 space indentation which makes the indentation much more apparent.
The terseness of YAML is great, but in my recent foray into setting up GitHub actions for a project, I found incorrect YAML very easy to write. I'd rather the extra verboseness of JSON or TOML just for fewer edit-run cycles from more explicitness. Even something a little bit more heavyweight like Lua would seem like a better option in some cases.
I'd rather the extra verboseness of JSON or TOML just for fewer edit-run cycles from more explicitness
Then lucky you, because YAML is a superset of JSON, and thus you are welcome to write your GHA in json and just save it to a file named .yaml. I don't envy the second person who has to edit a big block of shell inside a json string, but it should work just fine
I am coming to think that one of the biggest problems with YAML is that it encourages deep nesting. Deeply nested config is horrible to read, especially when it's 2-space indented. Kubernetes has a lot to answer for.
A lot of issues with configuration files are also fixed if you have a proper schema for the data. I usually create my schemas with Pydantic (or with schemars when using Rust) and I can thus I can write the schemas easily to disk. It makes authoring the files by hand much easier and makes it clearer how the data will be validated when it's loaded.