Crawling a billion web pages in just over 24 hours

12 points by gerikson


marginalia

TLS handshakes being a bottleneck seems like the logical consequence of having enormous 60s+ crawl delays and likely not pooling the connections.

It's questionable how polite this actually is. Handshakes are quite resource intensive on both ends.

Realistically if you're crawling at a rate of 1 billion pages per day, you aren't being polite at all. Such a crawl rate is far too aggressive and will land you on a lot of user-agent shitlist in a real hurry since rate limits typically need to be enforced on a first domain level. Different subdomains can generally not be treated as different websites.

tomhukins

one part of fetching got harder: a LOT more websites use SSL now than a decade ago. This was crystal clear in profiles, with SSL handshake computation showing up as the most expensive function call, taking up a whopping 25% of all CPU time

I doubt I'm the first to have this idea, but this made me wonder whether servers could use TLS handshakes as a proof of work for clients. If a cipher suite existed where the handshake required significant effort for clients, but little for servers, a server might choose to only support that cipher suite, or to treat connections using it preferentially to better handle badly behaved user agents.