The Country That Broke Kotlin
53 points by snej
53 points by snej
Big nit with the title: languages & nations are not the same thing. The main point: a language, Turkish, broke Kotlin (at least in Latin script form) due to camerality rules with a Turkish locale—not the country, Turkey. But actually then later the article points out it’s [some] Turkic languages like Azerbaijani & Kazakh (which—surprise—are not Turkey (not to mention speakers can live in any nation)) also have the same issue with the dotted/dotless I.
I understand a clickbait title with nuance in the body, but the premise is just wrong—as wrong as using a nation’s symbol, the flag, for a language picker. I was expecting some sort of nation state involvement issue based on the title.
If you want to deal with weirdly shaped Venn diagrams, the super-category is "people who speak Turkish and people who live in Turkey" and there's surprisingly little overlap between those two classes.
At the fundamental level: so what? The outliers are human beings & generally I choose not to exclude some folks on the outliers when I could just be more precise with my words/presentation.
At the reality level: If we look at Turkish diaspora we see over 14 million folks are not in Turkey. If we go back & see that Azerbaijani & Kazakh languages are also being omitted despite the same Latin letter I issue, that’s several million more folks.
What can we learn from it?
I learned two things:
Let us repeatedly ask why until we reach the true root cause of those bugs.
Why did these bugs happen? Because programmers kept using case conversion functions which turned out to be locale-aware which means their output changed depending on global system state.
Why was their locale-awareness not considered? Because the complexity of locales was hidden away in some optional parameter to the locale-aware functions.
Why was the parameter hidden? Because people do not actually want to deal with locales in the first place. Locales are an afterthought. They're just sort of there, just in case you care a lot and want to do things right. You can choose not to pass a value and let the system cope.
It's a lesson in API design. Things should be as simple as possible, but not simpler. Providing these "easy" functions tricks programmers into thinking they can get away with not putting any thought into locales at all. The bug might not have occurred had the locale been a mandatory parameter to the function. That would have forced the programmer to make their assumptions explicit.
The optionality of the parameter lets the programmer avoid passing in a value. It's as though they had passed null
instead. So in order to avoid nullability errors, the affected functions implicitly source the missing information from global system state instead, contributing to the confusion and to the difficulties in reproducing the bug.
It's just something that keeps turning up again and again... The mpv
project, for example, had to fix similar problems with locales and the result was this epic commit which discusses the issue at length (and a lot less politely).
This isn't a country breaking kotlin, this is standard anglo- (or maybe Euro-) centric coders choosing to use case conversion because "it works for me" rather than case insensitive comparison APIs, which breaks other languages. That the "fix" was disabling locale sensitive functions (e.g case conversion is functionally English) rather than just using correct comparison functions is kind of bizarre. You recognized the problem (anglo-centric code), and changed it to be ... more anglo centric? Wtf?
The comparison in question is happening between essentially internal compiler data structures. It is correct for this case conversion to be locale-insensitive.
You recognized the problem (anglo-centric code), and changed it to be ... more anglo centric? Wtf?
I believe this outrage is entirely unwarranted.
I think that you're right that English case conversion is the correct solution, since they are manipulating English text. But it's also reasonable to criticize the developers for taking an anglocentric perspective by not considering the ways that string manipulation can misfire if your API changes behavior by locale.
Actually I'm most inclined to blame whoever thought it was a good idea to make the behavior of stuff like "to_lowercase" dependent on global state. I think Rust has the best solution: to make very clear where a function on characters is defined for ASCII and where it is defined for everything, and to not let these functions be informed by global state. If you are manipulating text in a different language, you do stuff that makes sense on that alphabet or syllabary or what have you. There's very little you do in general across all possible writing systems.
Yeah, seems to me the issue in both cases was relying on default "works on my machine" behavior instead of rigorously establishing a rule that all case conversions have to specify locale. Being "more anglo centric" is fine here because the compiler actually is written in a particular locale. (Unless I'm missing something and there's a desire to support ıntArray
in the compiler?)
Yeah I believe the Kotlin developers are mostly from eastern and central Europe since that’s where JetBrains is based.
There are three issues in the article:
The first two are a closed set and would probably be better written as a lookup table, but ASCII-only conversion is fine.
The third one I am less sure about: Kotlin allows unicode identifiers, but I don’t know if these array type names can be user-defined unicode names. The fully general version of the problem is that parts of Kotlin’s language, libraries, and/or tooling may want to faff with the capitalization of identifiers that may be written in any combination of natural languages, so Kotlin’s identifier (de-)capitalization algorithm needs to be unicode-aware but language- and locale-insensitive.
Unless I misunderstood the article, it says that the code responsible for the third problem is also only applicable to the primitive types:
Much like the boxInt() function we saw before, intArrayOf() is part of a wider family of functions: one array-builder function for each primitive type.
Tangentially, I of course agree that all three cases should have been lookup tables. The entire issue could have been sidestepped by simply not trying to be clever.
I think they didn’t actually realise their problem was anglocentrism, just that they "hit an edge case"
Kotlin is a well-funded project built on top of a fully Unicode-aware platform. I think any proper post-mortem on an issue like this needs to include analysis of why the project had nobody thinking about these sorts of issues or, seemingly, even aware that this class of issues could exist.
What makes you think they didn't?
Some random developer writes .to_lowercase()
. They test it, and it works. Another random developer reviews the code. At what point in this process is Kotlin's Unicode Master supposed to be aware that this is happening? It's not feasible for them to review every PR.
Should the two developers responsible for writing "INFO.to_lowercase()
have expected this sort of bug? Personally, I knew that Unicode was black magic and that .to_lowercase()
was cursed, but I still would have expected pure ascii to behave deterministically. The set of strings was written by developers, not users. My only defense against this would have been my gut reaction that it felt like a code smell to me, something about "if you want them to have the same names, give them the same names" and "capitalization is vaguely cursed".
Some people will want to say that "everyone" should be aware of the fact that changing capitalization is locale sensitive even when restricted to pure ascii. And spreading knowledge like this is exactly what lobsters is for, at least in my mind! But damn, this one's pretty obscure. As far as things that "every programmer should be aware of", this definitely isn't in the top 100. (Consider: browsers, networking, security.) Maybe the top 1000, maybe not quite. Someone should write a book, 10,000 things that every programmer should know.
Your comment comes across as someone who believes Unicode is a topic to be disdained and ignored whenever possible (through your dismissive descriptions of "black magic", "cursed", etc.), and at best relegated to one "Unicode Master" on the team.
But that's precisely the problem that led to the bug here. And you don't quite seem to understand that, especially when you say things like this:
Some people will want to say that "everyone" should be aware of the fact that changing capitalization is locale sensitive even when restricted to pure ascii.
Everyone who works with text should know some basics of Unicode, and one of the basics of Unicode is being aware different scripts have different rules and behaviors and that therefore you should always be precise and explicit about the particular rules and behaviors you want.
That sort of guideline doesn't require a "Unicode Master" level of understanding. It's up there with my comment the other day about how regex character classes might match more things than you expect in a Unicode world and you should account for that for things that probably every developer should know and internalize about Unicode. But the continued disdain by programmers for Unicode itself leads to endless bugs like the one in this article, because what ought to have been basic principles every working programmer knows are instead treated as arcane expert-level knowledge to be hated and feared and avoided.
Your comment comes across as someone who believes Unicode is a topic to be disdained and ignored whenever possible
That's not at all what I think! I deeply respect Unicode, it tackles the incredibly complicated problem of digitally representing all major human languages and it does that very well! The cursedness is the complexity of human language, and the poor behavior of Kotlin's old .toLowerCase()
function.
The term "black magic" implies that something is to be avoided if it's not needed, and carefully researched when it is. It implies that you need a deep understanding, not a shallow one, and that it's not as simple as you might first imagine.
But that's precisely the problem that led to the bug here.
I would put the blame for the bug on Kotlin's old .toLowerCase()
function. It attempted to hide the complexity of Unicode, and that was the root of the problem. You can't change the capitalization of text without knowing the language that it's in. If the locale was required, and documented as "the language that the text you're converting is in", then the developer would have written .toLowerCase(English)
and everything would have been fine!
Instead it defaulted to the user's locale, which is a shitty default because an English user's computer can have Turkish documents on it, and a Turkish user's computer can have English documents on it. You can't tell the language of some text from the computer's locale, that's like trying to guess someone's language from their IP address.
I would put the blame for the bug on Kotlin's old .toLowerCase() function
Some rather important context: Kotlin's modern deprecated toLowerCase
and the new lowercase
method are locale invariant. It's the Java functions that are locale dependent which is where the mistake comes in and Kotlin until 1.5 was consistent with this behavior. I would not fault Kotlin for this, because that's what people expected. I believe this is also why they renamed the method to make it clearer that the behavior departs from what you are used to in the JVM.
I would put the blame for the bug on Kotlin's old .toLowerCase() function. It attempted to hide the complexity of Unicode, and that was the root of the problem. You can't change the capitalization of text without knowing the language that it's in.
All you're doing is shifting the blame from the Kotlin team to... the Kotlin team. I'm not sure how that rebuts anything I originally said.
All you're doing is shifting the blame from the Kotlin team to... the Kotlin team. I'm not sure how that rebuts anything I originally said.
Not every conversation needs to be an argument with a winner and a loser.
Please, don't do that.
I interpreted your original comment as saying: "The developer on the Kotlin team who wrote qName.toLowerCase()
must have been so ignorant of Unicode not to realize that .toLowerCase()
would be locale dependent and know that case conversion of ascii depends on locale. Everyone should know this basic stuff." If you weren't saying that, we may not disagree.
PHP had exactly the same issue and there are plenty of more examples out there of code that breaks on Turkish machines. It was a mistake of Java, C and other languages to make the capitalization functions locale aware. In very few situations (beyond writing a command line tool) is that the right solution. Cultures and locales really should be explicit arguments instead of some implicit default.
When I read that the person observing it was from Turkiye, I thought immediatly of this old blog post https://www.moserware.com/2008/02/does-your-code-pass-turkey-test.html
Everyone here is blaming locales, while I don't understand why the case conversions were needed in the first place.
In the first case, looks like the "canonical" way to write error levels is in all uppercase - so why have the dictionary for mapping names to enums in lowercase?
In the second case, capitalization is there purely for looking good - there is no good functional use for it - you could've easily gone by without it, using snake case.
In the third case, once again, removing capitalization is there purely for looking good, quickly. Having a map of the intrinsics would not have led to this problem.
All of these string operations were to automate making data mappings - mappings that could've been explicit.
While reading this post, this old story came to mind. Here, the difference between dotted and dotless i's resulted in deaths.
This makes me wonder, if programs (actual code, not merely comments or identifiers) were to be written in non-English (and later non-latin script using) languages, what can of worms would be opened for those programmers to face.
That was dark. That story was ludicrously dark.
Maybe the problem here wasn't the uppercasing or lowercasing or various accents on the characters, but rather the concept that murdering people who have insulted you (or someone in your family) is a reasonable response? I think perhaps we should invest a little more time on "don't murder people", before worrying quite too much about how phones deal with Unicode.
The problem here wasn't tech. The problem here wasn't caused by a bug. The problem here was that social norms included murder. Trying to address that problem with a more precise localization algorithm seems nonsensical.