Size matters, even on very fast connections

43 points by maurycy

fanf

Curiously I couldn’t find discussion of the initial congestion window in RFC 9293… oh, the intro says congestion control is left to RFC 5681. RFC 6928 proposed increasing it to 10 segments and refers back to earlier RFCs that discuss what it should be.

Back in the 1990s Netcraft published monthly stats of UK ISP web server performance, part of which was a very basic speed test: they just fetched a suitably sized image from the ISP’s home page. At Demon we gamed the test by arranging to serve the image from RAM instead of disk, and (relevant to this post) changed the initial congestion window of the web server so that the whole image was sent in the first round trip. (Files were small in the days of dial-up.)

isomer

If a packet arrives out of order, the receiver sends an ACK with the highest contigious data offset (although for this example we're counting in packets instead of bytes). When the transmitter receives a duplicate ack, it knows that one packet has left the network, so it is safe to transmit one more packet into the network without causing congestion thus is can transmit one more packet of new data. Once you have seen 3 duplicate acks, then you assume the packet at that offset was lost, and retransmit that specific packet of data. (https://datatracker.ietf.org/doc/html/rfc5681#section-3.2)

So as long as the packet that was lost was not the last packet, forward progress will still be made, and after ~3×RTT the lost packet will be discovered and retransmitted which is relatively quick. If the last packet is lost however, you have to wait for a timeout, which is an excrutiatingly long time (often measured in human perceivable portions of a second).

You must avoid timeouts at all costs. Timeouts are what makes networks feel slow.

The metric that really matters is the amount of time the user ends up waiting for operations to complete. Bandwidth is just a proxy for the how long a large transfer takes. High speed, but highly latent connections (eg via geostationary orbit) are often even more frustrating as everything takes forever to get going.

(Yes, there is SACK, but SACK doesn't really improve upon the above - It tells you which packets are lost, but you can already derive that from the above information anyway, you still need to wait for the 3 duplicate acks before you can retransmit because you don't know if it's just reordering, and it's the delay waiting to see if the data was re-ordered and is just going to turn up that hurts)

Jrmurr

I've never gone done this path of optimizing a site in this way. What tools do people use to check that their site works this way? Lighthouse score (that's a thing right....)?

Just looking at network tab?

maurycy

Just looking at network tab?

Often, yes. That's why it's there.

Outside of that, on Linux, you can simulate a very bad connection using 'tc', which is more realistic then your browser's throttling function. (which usually just adds a fixed delay which doesn't account for things like connection reuse or TCP congestion control)

There's also Wireshark. You usually can't see any actual data, but it does show can see how much there is and exactly when it arrived. Although real networks are messy and can be confusing:

Even if you have an maximum packet size (mtu) of 1500, you can get 2802 byte "packets" because many network cards will merge TCP packets received within a short time interval:

2x 1434 byte packets -> 2736 bytes of payload -> a 2802 "packet"

... and that's just what's happening between your system and your wire. The internet is full of weird edge-cases and subtly broken systems.

jonathannen

I often used to think that sending your SWEs to a remote place was a great way to make the product more robust. Now I think just a long flight with invariably flaky wifi is a decent start.

If you are writing a website, that first 13 kB should include everything needed to render the first screen's worth of content. This includes inlining any critical CSS or/and JavaScript.

Worth noting this advice has aged well across the shift to QUIC -- the ~10 packet initial congestion window carried over.

But the real enemy is packet loss. TCP makes it catastrophic (global head-of-line blocking). QUIC mostly localizes it. Once you're on a bad connection, recovery behavior dominates pretty quickly.