ntoh/hton is a bad API

42 points by 0x2ba22e11

fanf

I am similarly vexed by this common misdesign. See also Rob Pike’s old “byte order fallacy” article.

At least part of it is because this basic integer deserialization code should typically compile to an unaligned load followed by a byte swap instruction, so the low level translation kind of reinforces the bad tradition. Rob Pike’s suggested code ought to be wrapped up in a function, for convenience and safety; and since it will be in the hot path it might be better to write the function in a way that doesn’t rely on the compiler’s idiom detector, which pushes you towards using byteswap intrinsic. (Tho tbh compilers have been perfectly capable of doing that automatically for decades so the intrinsic seems unnecessary to me.)

jcelerier

Both Rob Pike and this article just handwave the main reason this is done - performance. From Rob Pike's:

It may be a little faster on little-endian machines, but not much, and it's slower on big-endian machines.

in my experience, it's one of the easiest "easy wins" to go and get when your job is to make processing some network input faster: ensure that the receiving machine can know what endianness the data is, and if it matches its own endianness, then reinterpret the data directly as the structure we want to use, without any expensive deserialization step. e.g. the absolute last thing we want to do is

i = (data[0]<<0) | (data[1]<<8) | (data[2]<<16) | (data[3]<<24);

which may even require a copy and memory allocation and double the memory size if it's not just one "i" but a few thousand of those.
- masklinn
  
  Fwiw this compiles to a noop in clang at -O on an LE machine, and the other one compiles to a simple bswap.
  
  Gcc requires O3 for it, but also achieves it.
  - mort
    
    I think you misunderstood. It's not that i = (data[0]<<0) | (data[1]<<8) | (data[2]<<16) | (data[3]<<24); is slow in itself (though it actually surprises me to learn that GCC fails to optimize it at -O2, the by far most common setting for release builds). The problem is that in order to do any operation, even a no-op copy, for every integer, you need to structure your program inefficiently.
    
    If you want to read 100 000 integers from a socket or file of some kind into memory and then operate on them, and the endianness matches, the absolute most effective way to do that is something akin to:
    
    uint32_t *integers = new uint32_t[100000]; read(fd, integers, 100000*4);
    
    And then you already have the data in the "deserialised" representation you need to work on it.
    
    Using the "no-op" endianness conversion would force us to write the code like this:
    
    uint32_t integers = new uint32_t[100000]; unsigned char buf = new unsigned char[100000*4]; read(fd, buf, 100000*4); for (size_t i = 0; i < 100000; ++i) { unsigned char *val = &buf[i*4]; integers[i] = (val[0]<<0) | (val[1]<<8) | (val[2]<<16) | (val[3]<<24); }
    
    A "sufficiently smart compiler" (maybe Clang at -O2, but apparently not GCC at -O2) might recognise that the loop is just a memcpy. Still, that's an unnecessary copy.
    
    An obvious optimisation here would be to use the same buffer as the input and output, converting byte order in place. A "sufficiently smart compiler" could then even recognise that the endianness conversion is a no-op and slide the whole loop. But... this runs afoul of the complaints described in the article, and invites a ntoh*-style API.
    
    purplesyringa
    
    There's a middle ground. You can read the data from fd into a byte buffer, and then instead of using the array integers, provide a "getter" that takes an index i, multiplies it by 4 and parses the data at that offset. Provided you're working in a high-level language, this can often be abstracted away and be just as clean as a native array. The bonus here is that you can use more complicated types and can even specify the layout in runtime. That is the approach many zero-cost deserialization libraries use, see e.g. rkyv for Rust.
    
    jcelerier
    
    This is not really zero-cost even if the API can certainly look nice. And that's only the most basic of cases. Now imagine you have something like variant<int, double>
    
    purplesyringa
    
    Well, it's as zero-cost as possible given that conversion needs to happen somewhere. On LE machines, the getter should compile to a mov, and on BE machines, it should compile to a mov and a byte swap. (Bonus points to ensure the mov can be aligned.) Arguably, deserializing data as you're parsing it is even cheaper than pre-deserializing it and fixing endianness at the beginning of the algorithm, since the former only introduces an operation on registers (byte swap), while the latter also requires reading and writing memory.
    
    variant<int, double>
    
    I think this might even support might point. How variant<int, double> is encoded in storage and in runtime might reasonably differ -- for example, you might want to use NaN boxing to store the variant in 8 bytes in storage, but decompress it to std::variant in runtime for compatibility and simplicity. With a getter, a good compiler can inline the parsing logic and produce more efficient code.
    
    calvin
    
    Modern ISAs usually have load/store reversed, so the load and swap would be a single instruction.
    
    mort
    
    Yeah, that would work, with the benefit that you don't need to do a copy. The downside is that it can't work with an algorithm that just expects contiguous integers in memory. (And the efficient instructions to load data from memory into a vector register assume contiguous memory with native endianness.)
    
    purplesyringa
    
    Hm... in this case, I think you can still write something like:
    
    unsigned char buf = new unsigned char[100000*4]; read(fd, buf, 100000*4); uint32_t* integers = (uint32_t*)buf; for (size_t i = 0; i < 100000; ++i) { unsigned char *val = &buf[i*4]; integers[i] = (val[0]<<0) | (val[1]<<8) | (val[2]<<16) | (val[3]<<24); }
    
    It's ugly, sure, but it demonstrates that this is at least possible to implement. IIRC, Rust even has some tricks where mapping a vector with .into_iter().map(...).collect() does not allocate a new vector but reuses an existing one as long as the sizes of the input and output types match, so this could be made idiomatic (though don't quote me on that).
    
    mort
    
    I'm pretty sure that's an aliasing violation.
    
    What you can do is:
    
    uint32_t *integers = new uint32_t[100000]; read(fd, (unsigned char *)integers, 100000*4); for (size_t i = 0; i < 100000; ++i) { unsigned char *val = (unsigned char *)&integers[i]; integers[i] = (val[0]<<0) | (val[1]<<8) | (val[2]<<16) | (val[3]<<24); }
    
    This should be legal because char and unsigned char can alias any type (it's just the inverse that's illegal).
    
    But... don't we now run into the exact complaints of the article? We're conflating the data type used for manipulation and the serialization format, by reading the serialized data into an array of uint32_t. The integers in the array are nonsensical before the loop, yet we call them uint32_t.
    
    Though I guess it could be an alright compromise. It seems like GCC on -O2 does optimize away the loop, so it's fine from a performance perspective: https://godbolt.org/z/n3oxaoKn5
    
    purplesyringa
    
    I'm pretty sure that's an aliasing violation.
    
    Ah, right, my bad. Didn't think this through, I'm a Rust programmer by trade.
    
    But... don't we now run into the exact complaints of the article?
    
    That's true. I suppose an (arguably) better implementation would be:
    
    unsigned char buf = new unsigned char[100000*4]; read(fd, buf, 100000*4); for (size_t i = 0; i < 100000; ++i) { unsigned char *val = &buf[i*4]; new(val) uint32_t((val[0]<<0) | (val[1]<<8) | (val[2]<<16) | (val[3]<<24)); } uint32_t* integers = std::launder((uint32_t*)buf);
    
    ...but that is understandably niche.
    
    I think the fact that this weirdness is required to stay type safe is a signal that the C++'s type system doesn't want us to do that. This is odd from a language claiming zero-cost abstraction, but perhaps that's more of a note about typed memory being a mistake in general than anything else. I don't know.
    
    BenjaminRi
    
    GCC does it at -O2 (and -Os) already: https://godbolt.org/z/cTWG548oT
    
    mort
    
    Ah, that makes more sense. It did seem very weird to me.
    
    masklinn
    
    TIL -O is O1, not sure why I thought it was O2.
    
    rau
    
    Our field is very broad, stretching from human-timescale tasks such as interpreting keys down to machine-scale tasks like efficiently handling gigabytes of data per second.
    
    Yes, there are cases in which the overhead of proper deserialisation really does impose a significant overhead. Reading 248,832 32-byte integers from a socket and summing them together is an example. But the overhead of deserialisation disappears as the amount of work to be done increases. Even in the summation example there’s a good chance that it’s not just a single assembly ADD instruction: with hundreds of thousands of 32-byte integers, the odds are pretty decent that the result will overflow, and so there need to be instructions to handle that, for each addition.
    
    But for many use cases performance is not as important, and correctness and resiliency are. As a general rule, it is preferable to deserialise rather than to assume that input data are well-formed. Programs written like that do perform worse, but the negative impact of running slowly is often not as bad as that of running incorrectly, or responding improperly to poorly-formed or malicious inputs.
    
    The traditional expectation in C was that a struct maps obviously to memory, and that one should just dump bytes from memory into a file or socket or whatever. In that world it makes sense to in-place modify integers to some standard endianness and dump them as before. But that way of doing things is brittle. For most software, most of the time, it’s preferable to have an explicit deserialisation/parsing step in which one performs some input validation (for example, are all integers allowed, or only a subset?). Dumping structs full of null-terminated strings and binary integers is too low-level and error-prone for a lot of what we do.
    
    BenjaminRi
    
    Comment removed by author
    
    mort
    
    See https://lobste.rs/c/rzj8ur. It's not about operating on a single integer.
    
    tobin_baker
    
    It's funny how we've become accustomed to thinking of an arbitrary number of bits (8) as the canonical unit of addressing. I think bit-endianness is more natural than byte-endianness, and if we had bit-level addressing then endianness representation would probably generate less misunderstanding.
    
    fanf
    
    Yeah, endianness should be taught as a bit order thing. Byte order is a consequence of bit order and chunking into bytes. But all the introductory and reference material ignores bit order, perhaps because even when we get to deal with bit-serial data connections the programming interface is just bytes. This leads to things like Erlang’s binary syntax (which is mostly very good) screwing up little-endian bit-packing.
    
    There are good engineering reasons that the IBM 360 chose the 8 bit byte, it isn’t completely arbitrary.
    
    Bit addressing makes hardware much more expensive
    
    Bytes need to be a power of two so that bit addressing is reasonably cheap to do in software
    
    Word addressing is inconvenient and slow for character data
    
    An 8 bit byte is the smallest that’s efficient for character data
    
    Some of these considerations have become a little less compelling since the early 1960s. It might be more feasible to do bit addressing now, if it can be designed not to affect the memory system outside a cpu core. (But bit addressing was one of the things that contributed to the Intel 432’s demise.) And in the 1990s Unicode made an argument for 16 bit bytes, but that turned out to be really inconvenient after 3 decades of 8 bit bytes.
    
    8 bit bytes remain the sweet spot.
    
    purplesyringa
    
    I mostly agree, but want to add something. When we're talking about byte encoding and byte endianness, we're discussing serializing integers -- i.e. an abstraction -- into bytes -- which are also an abstraction! So the concept of byte endianness is independent of bit endianness. Sure, it might be easier to work with long data on the physical level if the bit and byte endianness agree, but this exclusively applies when you're sending, say, 32-bit values dumped from successive bytes in memory in a serial format, not when you're working with individual bytes in hardware or long numbers in software.
    
    fanf
    
    So the concept of byte endianness is independent of bit endianness.
    
    This kind of thinking is how Erlang fucked up their binary syntax. If you treat endianness as something specific to a particular data width then you get other fuckups like DEC and ARM mixed-endian floating point, UUID erratic byte ordering, etc.
    
    If you treat endianness by starting with bit ordering, treating data consistently as a bit string in a fixed order that is chunked into larger data types, then you get sensible endianness in all situations, including when fields are not a multiple of 8 bits in size.
    
    This is the point that Danny Cohen made in 1980 in his essay about the endians.
    
    dfawcus
    
    Strictly, in C one can take (pointers to) arrays as parameters, it is arrays as return values which is not supported.
    
    typedef uint8_t (Ser32a)[4]; void a_serialize(uint32_t in, Ser32a *out) { (*out)[0] = in & 0xff; (*out)[1] = (in >> 8) & 0xff; (*out)[2] = (in >> 16) & 0xff; (*out)[3] = (in >> 24) & 0xff; } void a_deserialize(Ser32a const *in, uint32_t *out) { uint32_t val = (*in)[0]; val |= (*in)[1] << 8; val |= (*in)[2] << 16; val |= (*in)[3] << 24; *out = val; } int main() { Ser32a buffer; const uint32_t input = 0x31323334; a_serialize(input, &buffer); printf("A: input %#x => %.4s\n", input, buffer); uint32_t output = 0; a_deserialize(&buffer, &output); printf("A: output %.4s => %#x\n", buffer, output); return 0; }
    
    However, yeah - one would usually want to wrap the array in a struct, then it can be passed in or returned.
    
    alurm
    
    Very true. To forget to convert an int before setting sin_port for sockaddr_in before bind. Happened multiple times to me and I was confused every time.
    
    carlana
    
    What I want to know is, does my CPU write the bits in a byte from left to right or right to left?
    
    kornel
    
    it writes them in parallel, and the order of the lanes is probably whatever fits best wherever it happens to be.
    
    carlana
    
    That's the joke. :-)
    
    Garbi
    
    Y'all need new words for little endian, big endian, and endianness. They seem like historical artifacts from a couple of early programmers just starting to talk about such things and making up words on the spot that never got improved.
    
    fanf
    
    There’s nothing wrong with endianness terminology: it’s straightforwardly descriptive and has a fun literary reference. It’s better than the alternatives.
    
    dzwdz
    
    it’s straighforwardly descriptive
    
    ...I actually didn't realize these names are descriptive - how "little"/"big" refer to the byte you "start" with - up until now. I only knew which one was which by dumb memorization (my mnemonic was that x86 is LE, network is BE, and I went from there).
    
    I even knew the literary reference! The analogy just didn't quite connect all the way for me.
    
    purplesyringa
    
    Thank you so much, I finally get it too.
    
    carlana
    
    I think the terminology is fine, but you could call it "most significant byte first" and "most significant byte last" and it would be relatively clear. You could abbreviate that to sig-first and sig-last.

ntoh*/hton* is a bad API

ntoh/hton is a bad API