Punycode: My New Favorite Algorithm
26 points by iand675
26 points by iand675
Am I too paranoid, or are parts of this AI generated? The base36 section, for example, has ao much fluff just to say that it's all the characters allowed by the DNS spec. There's even a bullet list of examples of the different inefficient bases.
It's a shame - IMO all the fluff just detracts from what I think is the main point of the article (the adaptative encoding), which is genuinely interesting.
Base-36 extracts every bit of information density possible while playing by DNS’s rules. When you’re encoding hundreds of millions of domain names, these efficiency gains matter.
I think this doesn't have much to do with how many domain names you are encoding (if performance was the concern, I assume base32 could maybe even be the better choice?). Domain name lengths are limited, so you just don't want to waste space that could be used to store more data.
I don't understand the "aü北aü京" example. Won't the letters get sorted anyways? 35,180,170 doesn't really seem like a "wildly oscillating" bias, since it settled down around 170-180. The graph actually makes it look like it's reaching more extreme values with damping - but they don't match up with the numbers in the text. idk what that's about.
(Also, I'm not sure why, but bücher renders with a regular "e" for me.)
Fair feedback, thanks!
I used AI as an editor on this, but not for content. I stitched the content together from my running notes while implementing the IDN library. I've had some recent success in blog posting, so was trying to capitalize on getting one more post out before exiting holiday mode, but maybe rushed a bit here :)
When I initially made notes I didn't really fully grok damping vs skew, but I think it'd be more accurate to say that damp is simplistic temporal smoothing based on sample size, whereas skew shapes how aggressively bias responds to any given delta magnitude. I'll think about how to address it better, but you're right in that oscillation isn't really a thing. It was more my own bias about what damping meant in other contexts biasing how I interpreted it.
Regarding your quote:
Base-36 extracts every bit of information density possible while playing by DNS’s rules. When you’re encoding hundreds of millions of domain names, these efficiency gains matter.
I think I addressed it when I said:
Base-32: 5 bits/char, closer but still leaving efficiency on the table for no reason
My point about efficiency wasn't encoding speed or something, it was about minimizing bytes over the wire
I used AI as an editor on this
Honestly, I would've preferred to see the unedited version. It doesn't seem like the changes made by the AI were very good.
I thought this article was very nice. I think LLMs can be very good editors (also mobile friendly when working from a phone). Just because some parts of this are wordier than some people would write doesn’t make it bad. If anything, it’s good because it’s skewing your writing for comprehensibility.
good because it’s skewing your writing for comprehensibility
I think the other commenters are arguing exactly the opposite. The fluff gets in the way of the good information, risking obfuscation. I've done some technical writing in the past and a key component in that field is the ability to deliver information in a clear and concise way. Authors who rely on AI, which often confuses and confabulates, hinder themselves from learning how to do this well. You get the recipe-blog effect; too much backstory, too little instruction, leaving the reader annoyed which we obviously don't want.
I have to agree with them as a person who struggles to develop his writing into something that would attract a wider audience. Struggle as I may, I do so knowing that if I do not put in the work, I will never improve.
Maybe I’m strange but I find more explanation easier to understand than a very terse explanation of something I don’t already know
I don't think you're strange. I think it depends on the information being conveyed. It's also important to know your audience, but that is, of course, easier said than done.
One of the things I've written was operating and general maintenance instructions for a laser engraving machine, a machine that I understand very well since I was an operator for a few years. I had to catch myself when it came to doing deep-dives into the safety data or find ways to summarize the reasoning behind a specific step into a single cohesive sentence. The typical operator likely did not have the patience to read through a thick manual before operating, but needed to get the info they needed to be able to operate the machine as safely as possible. It was extremely challenging.
By contrast, my own blog is more of a meandering stream of consciousness about whatever I'm working on at the time of writing; proof that I will get my word on whenever someone gives me the opportunity. But for something like that, I think it's okay to fluff things up since, so far as I have been told, none of my readers are really expecting to learn anything new, just visiting to read my story (I colloquially refer to this as "bathroom reading" since it's the kind of thing you can read while on the toilet instead of doom scrolling).
Personally, I do like knowing the why of a specific step or set of instructions, because it helps me understand what I'm doing better, but by all accounts, I'm the strange one. Perhaps we both are? Most of the people I've talked to about this over the years express the same sentiment; give me what I need to get it done, that's all.
Oddly, I get that way with recipes (I like to bake), which is why I mentioned the example previously. Just give me the steps, not a history of grandma's farm or whatever. Maybe the difference is that I'm trying to get a job done rather than reading for pleasure? Something to think about, I suppose.
Unfortunately, I'm not expert enough about the way people tend to take in information to really offer anything more than anecdotes from my own experience.
DNS infrastructure only supports ASCII characters (a-z, 0-9, and hyphens).
The actual problem that punycode solves is more knotty than this.
The DNS supports arbitrary binary in DNS names, except that ASCII upper case and lower case letters are treated the same.
Around the DNS there are many things that restrict domain name syntax to (roughly) the pre-DNS ARPANET letter-digit-hyphen host name syntax. These things include protocol elements such as email addresses and URLs, and pervasive software that enforces the traditional syntax such as stub resolver libraries, mail servers, web servers, usw etc pp
(By contrast, Multicast DNS uses DNS wire format but doesn’t try to be compatible with these upper layers, so it can and does use raw UTF-8 in DNS names.)
Internationalized domain names are called IDNA because the i18n happens in applications, basically as close to the user interface as possible. Punycode translation is designed to happen at the point when a domain name is being input or displayed, so that all the existing ossified infrastructure does not need to be upgraded for i18n. That includes lower-level libraries linked into the application.
Punycode is about compatibility with protocols and software surrounding the DNS that assume ASCII letter-digit-hyphen domain names.
The article was largely interesting, though it spent far too long talking about base-36 (it seems pretty obvious that this is the optimal encoding given DNS constraints?) and also about damping (the idea that the very first delta is often much larger than the rest also seems fairly self-evident). And what it didn't talk about is what I'm more interested in, both how the delta is actually calculated (i.e. how it includes the position), and what the bias actually does. It explains that the bias affects the variable-length encoding but it doesn't actually say how, and so e.g. I have no idea how a single-digit encoding can represent deltas of 0–500.
I've started reading the RFC itself to try and answer these questions and RFC 3492 §3.2 "Insertion unsort coding" is pretty interesting and explains that the delta is actually the number of non-inserting state transitions where the state is a pair of integers <n,i> where i is the position in the string being built (ranging from 0 to the current length) and n is a counter of the number of times that we've wrapped around, and every state transition increments i, or sets it to zero and increments n if it would exceed the length of the string. This makes a lot more sense now as to how the delta encodes both the character and the position (and how that fits into the sorting of deltas by codepoint). I haven't yet gotten to the variable-length encoding and bias, it looks like that's the next section, but I'm about to read that and I expect it to be similarly informative.
I tried to use a non-ASCII domain for a bit and found it to be essentially unusable. Too many systems still expect domains to be ASCII and need you to enter the punycode-encoded version of the domain. Too many systems don't know about punycode and display the punycode-encoded domain rather than the decoded version.
It's simply unacceptable to me that when I send e-mail to customers, most of them will see that it's coming from martin@xn--drumtech-54a.no (even if they're using clients from the big companies like Mail.app and Outlook). That looks sketchy as hell. So I moved everything over to an all-ASCII domain.
Totally fair. If pressed, I wouldn't use a domain that required non-ascii characters either. But I needed it to support a library, and the algorithm itself for how it's managed is quite elegant! Even if the rest of the ecosystem makes for a poor experience...
Is this incorrect though?
The example it gives "bücher" becomes "bcher-fxa" isn't what Python gives
>>> "bücher".encode('punycode')
b'bcher-kva'
The interactive encoding visualizer at the bottom of the article also says the same thing, that bücher encodes as xn--bcher-kva. When I try another tool I have for doing conversions it says that xn--bcher-fxa decodes to ćbcher.
It was somewhat interesting, though I found a bunch of it felt very AI slop-ish, mid way through I started finding it quite a slog due to myriad AI-isms in construction, which I found confusing as other parts of it felt like an actual person wrote it, I'm wondering if some kind of "AI assistant" was used on the original text? It just feels too unnatural - it has that kind of "averaged" text and heavy repetition which is something all the "AI" does ("AI" is glorified text prediction, so will always tend to the average).
It's a shame as the content itself was really interesting, and the author clearly found it interesting, which is really nice to read - a lot of posts these days seem to be humble brags or subtle (or not so much) ads for frameworks, tech, apps, services, ... - and this was really refreshing.
One comment I do have is that context free decompression is the standard. All non-static compression algorithms have to start from a base statistical model: some archive formats might have a builtin initial dictionary but that hasn't been particularly common for a long time, most these days start with either a completely blank slate or some relatively trivial set of initial states. Once you have a dynamic model of the content being compressed, you can only compress on the basis of the preceding symbols, because that is all that is available during decompression.
Punycode is quite interesting! The part I didn't get, maybe I missed it in the article is how one encodes the position. I get that first we take all ascii chars in order, and the we sort the non-ascii ones and iterate. But where do we put the position of the character?
Yeah, I read through the entire article expecting this mystery to be finally solved, but no, it just keeps omitting this crucial detail. Like in the following section:
To decode, you reverse the process: split on the last hyphen to get bcher and fxa, decode fxa to recover the delta 124, add it to 128 to get 252 (which is ü), and insert it at position 1 to reconstruct bücher.
Where does position 1 come from?
I never found out, and now people are pointing out that the article is AI slop. It all makes sense now.
I should have checked wikipedia. It says:
The non-ASCII characters are sorted by Unicode value, lowest first (if a character occurs more than once, they are sorted by position). Each is then encoded as a single number. This single number defines both the location to insert the character at and which character to insert.
The encoded number is insertionPoints × reducedCodepoint + index. By dividing by insertionPoints and also getting the remainder, a decoder can determine reducedCodepoint and index.
There are 6 possible insertion points for a character in the string "bcher" (including before the first character and after the last one). ü is Unicode code point 0xFC or 252 (see Latin-1 Supplement), and the reduced code point is 252 − 128, or 124. The ü is inserted at position 1, after the b. Thus the encoder will add the number 6 × 124 + 1 = 745, and the decoder can retrieve these by ⌊745 / 6⌋ = 124 and 745 mod 6 = 1.