IPv6 zones in URLs are a mistake
41 points by gmem
41 points by gmem
TL;DR: computers were a mistake.
It's a bit of throwing the baby out with the bathwater isn't it.
Literal anime villain logic(e.g. Trigun): humans are capable of heinous crimes, so I'm going to eradicate all humans.
I know it's said tongue in cheek but it's interesting to me, this is going to be the topic of my next talk which I'm in the middle of working on.
Yes, yes, the title is very dramatic and clickbait.
But the point still resonates with me; this stuff is such a silly state of affairs, isn't it?
I didn’t learn anything from the article (percent encoding? so what?) but the discussion here revealed that the mess is much deeper and has far gnarlier tentacles (they fucked up percent encoding? wtf?) than I expected even with my starting assumption that scoped IPv6 addresses usually don’t work properly…
Excellent taste sir! Trigun was the second anime I watched on my own, and to this day is my favourite (although this year's Journal with Witch is hard to beat...). Nice to see it called out :)
The actual mistake was having zones at all and to then permit them to be anything other than numeric indexes. On Macs zones are even encoded into the ip address at all which makes me wonder why that was not the option that could have been used instead.
The whole thing is pretty bad. Hostnames are also case insensitive in URLs yet for Linux EN0 and en0 are different devices. Then between WhatWG and the IETF standards there is currently no agreement on quite a few of these things.
Yeah, fe80::/10 is huge! With SLAAC we're barfing entire MAC addresses into global unicast addresses (or we were, before privacy extensions), why couldn't we encode a measly interface identifier into link-local addresses?
That doesn't work because it tells you what the interface ID is on the remote system. The kernel needs to know the interface ID on the local system so it can use the right interface. This information just doesn't belong in the address that's sent on the wire because it's specific to the local system.
This information just doesn't belong in the address that's sent on the wire because it's specific to the local system.
Sure, we eventually figured this out for MAC addresses too and came up with RFC 4941. But also, I think having this textual "scope" metadata in IPv6 addresses that disappears on the wire is the even worse option as it breaks too many rules. Now we have no 1:1 mapping between IPv6 addresses as they appear in config files, logs etc. and IPv6 addresses as they appear on the wire.
MAC addresses are different because the other end can just treat that part of the address as opaque. Interface IDs are meaningful on both ends. How is it supposed to work when the interface ID is different on each end?
For example, two hosts are on the same link-local network, but on host A the interface ID for that network is 1 and on host B the interface ID is 2. What's the link local address for host A? Does it embed 1 or 2? If it embeds 1, then B can't contact A because it will try to route the packet out interface 1 instead of 2. But it can't embed 2 either, because then A wouldn't know its own address since 2 is specific to host B.
Failing to properly account for link-local scope IPv6 addresses in URLs is a mistake. Although, it's not uncommon for things to fail to handle link-local IPv6 addresses properly.
On the other hand, accounting for zones in IPv6 addresses in URLs adds a fair amount of complexity because % now means something different only in the host part of the URL and then only inside [...] when the host is an IPv6 address.
The grammar might still be unambiguous but that's not the issue - the issue is that the more edge cases like this exist, the more likely any given URL parser is to overlook that edge case, and parser differentials are where nasty bugs or even security issues sneak in.
It would definitely be my preference to account for IPv6 zones in URLs, but given that it was the guidance at one point that the % must be url encoded, changing that back at this point would create an ambiguity. Unfortunate.
A lot of URL/URI/IRI libraries do already split up the parsing of individual parts into their own functions and it shouldn't be difficult to cover in testing. (for example go net/url, servo/rust-url, python urllib, and yescallop/fluent-uri-rs)
It is honestly a shame just how divided all the implementations are in which RFC's they follow, it puts IPv6 zone identifiers in an undeserved difficult position.
I don't think it will still be unambiguous. Say that you have an interface called dc1 (perhaps from the OpenBSD dc(4) driver for DEC/Intel 21140/21142/21143/21145 and clones. Then, fe80::4%dc1 in an URL is ambiguous whether the thing after the 4 is an encoded 0xdc or whether it's a zone.
I covered this ambiguity in my third paragraph
Hmm? I understand and agree with your proposition that changing it from "need to encode the %" to "the % does not need to be encoded" creates ambiguity, yes.
Yet I'm claiming that using unencoded % is simply ambiguous, regardless of whether we are switching from %25.
Per RFC 3986 percent encoded octets are not valid inside IPv6 addresses, so in theory % is up for grabs there, it has no defined meaning.
RFC 6874 proposed to extend the syntax defined in RFC 3986 to allow IPv6 zone identifiers with % percent encoded as %25, but it was found to be unimplementable.
As RFC 6874 mentions, some browsers had already started to accept zones without requiring the % to be percent encoded. I have no idea what happens if you do need to percent encode an octet which is part of the zone name in those browsers.
It seems like the current recommendation is just publish an mDNS name and avoid the whole issue. It seems like it's all too messy to resolve at this point.
How was it found not to be unimplementable? My own URL parser implements the RFC-6874 extension, and I didn't find it unimplementable, and the link provided wasn't to a direct discussion of the issue.
Chasing down references points to this document, which basically says that the problem is that
%en1, which makes parsing complex and potentially ambiguous if your zone name starts with two hex digits, or you don't and usability suffers.RFC 9844 says that user interfaces MAY use an alternate delimiter instead of %, producing e.g. fe80::1-eth0, which solves the % issue but not the origin one, and in particular that RFC specifically exempts web browsers.
I really don't like the tone in this article. It would've been nicer to present it as 'look at this unexpected quirk I just learned about!' rather than negatively label technological developments and the entire industry just because a thing didn't happen to work the way the author imagined.
The way the zones are encoded in the URL makes perfect sense even, it's consistent behaviour: you have a percent sign, you need to escape it. Yeah maybe it's a bit annoying because you need to remember this when you type a zoned IPv6 address. But don't blame entire industries because of your minor oversight.
Or, how the kids these days would say, "skill issue".
So in the meantime in order for Anubis to point to IPv6 zoned addresses, you need to encode the % with percent encoding. This is horrible, but it seems that this is an edge case that applies to other frameworks, programming languages, and libraries:
Any library/thing that's compatible with RFC 3986 will encounter this issue because the spec defines percent encoded sequences like this:
pct-encoded = "%" HEXDIG HEXDIG
HEXDIG in turn is defined as [a-fA-F], and so it parses %et as an invalid sequence. The spec also states the following:
Because the percent ("%") character serves as the indicator for percent-encoded octets, it must be percent-encoded as "%25" for that octet to be used as data within a URI. Implementations must not percent-encode or decode the same string more than once, as decoding an already decoded string might lead to misinterpreting a percent data octet as the beginning of a percent-encoding, or vice versa in the case of percent-encoding an already percent-encoded string.
So yeah it sucks, but it's not really a bug.
So in the meantime in order for Anubis to point to IPv6 zoned addresses, you need to encode the % with percent encoding. This is horrible
Why is this horrible? Encoded characters are allowed in the host part and using % directly for zoned addresses would create an ambiguity. % itself is not unreserved, and as such it should be percent-encoded.
Go also does not seem to follow this RFC in
net/url.
It's not new for Go's net/url and net/http to be in conflict of the URL RFCs. In particular, I find the existence of net/url.URL.Path and its use in net/http to be rather annoying (breaks %2F). net/http.Redirect also uses path.Clean, which incorrectly collapses // to /.
There are many reasons I would like to fork the URL-touching parts of the Go standard library (or just propose having a net/url/v2, or such). But I believe Go's handling of IPv6 zoned addresses is sound and correct, from what I can see in this article.
I agree with the first part of your comment – and most of the second part. But speaking (as someone with plenty of issues with golang!) from experience with web programming and trying to solve redirects at the client/backend/webserver level, I think the path.Redirect/path.Clean behavior is probably the desired result in 99.99% of the cases, and doing it any other way would cause more problems than it would solve.
Hmm, I understand your point. I think it's very valid to supply middleware that does so. I still do believe that the current shape of the stdlib makes it excessively hard to get strictly spec-compliant behavior for the times I need it.
I think non-loopback link-local IPs in http(s) URLs are nearly always a mistake. IME, DNS (or at least mDNS) is worth it the instant you need to hit an IP address from client that can't reach the loopback. It's a trivially automated creature comfort.