Crawling a billion web pages in just over 24 hours

12 points by gerikson

marginalia

TLS handshakes being a bottleneck seems like the logical consequence of having enormous 60s+ crawl delays and likely not pooling the connections.

It's questionable how polite this actually is. Handshakes are quite resource intensive on both ends.

Realistically if you're crawling at a rate of 1 billion pages per day, you aren't being polite at all. Such a crawl rate is far too aggressive and will land you on a lot of user-agent shitlist in a real hurry since rate limits typically need to be enforced on a first domain level. Different subdomains can generally not be treated as different websites.

tomhukins

one part of fetching got harder: a LOT more websites use SSL now than a decade ago. This was crystal clear in profiles, with SSL handshake computation showing up as the most expensive function call, taking up a whopping 25% of all CPU time

I doubt I'm the first to have this idea, but this made me wonder whether servers could use TLS handshakes as a proof of work for clients. If a cipher suite existed where the handshake required significant effort for clients, but little for servers, a server might choose to only support that cipher suite, or to treat connections using it preferentially to better handle badly behaved user agents.

jsnell

The economics of any kind of compute-based[0] proof of work for counter-abuse are completely unworkable.

Real users end up paying in user-visible latency, which is really expensive. Scrapers would pay in compute, which is fungible and dirt-cheap. To get an intuition for the gap, what is the value of one hour of your time? The value of one hour of a server grade CPU core is about 1 cent.

[0] Proof of work by wasting bandwidth might have a tiny intersection of applications and threat models where it is viable, but even then it is not deployable just from a PR angle.
- strugee
  
  The other problem is that, even if the numbers you just mentioned change, if you're scraping at scale because you e.g. have AI VC money there's a huge incentive to just deploy a hardware accelerator. Surely at scale it would be cheaper to develop one than burn server CPU on an expensive cipher.
  
  (That being said I'd be interested where you got the 1 cent number from.)
ryan-duve

So the idea is if it's computationally expensive to crawl pages then bots won't do it, or will do it too slowly to be practical? Would that also impact user experience on low-power devices like phones? Would that affect search engine crawlers?
- tomhukins
  
  So the idea is if it's computationally expensive to crawl pages then bots won't do it, or will do it too slowly to be practical? Would that also impact user experience on low-power devices like phones?
  
  Some sites have started using JavaScript-based proof of work tools to restrict access to Web content after becoming overwhelmed by badly behaved bots. I imagine the approach I describe would perform the same function but at the TLS layer.
  
  I've seen people complain about existing proof of work tools on various devices, not only low power devices, so I imagine those problems would remain.
  - hoistbypetard
    
    Some sites have started using JavaScript-based proof of work tools to restrict access to Web content after becoming overwhelmed by badly behaved bots. I imagine the approach I describe would perform the same function but at the TLS layer.
    
    I don't think that'd have the same function, really. IMO, what is stopping the badly written bots is just that they haven't built the logic to complete the proof-of-work, not that it'd be too expensive.
    
    If it was at the TLS layer, the logic would be transparent to the bot, and the bot operators would pay the compute cost. So the scraping would happen, and you'd really have to ratchet up the compute cost to truly rate-limit it. I believe it'd be much more computationally expensive for the client than Anubis and similar gadgets are now, and the result would be laggier pages for users (since the Anubis determination can be stored in a long-lived cookie, so users rarely do it repeatedly for the same site) because they'd redo the proof-of-work on every handshake.
    
    gerikson
    
    If it was at the TLS layer, the logic would be transparent to the bot, and the bot operators would pay the compute cost.
    
    Can't unethical scrapers just avoid that by using compromised hosts to scrape the content?
    
    hoistbypetard
    
    Oh, definitely! But I'm saying that I don't believe the compute cost is stopping them, whether it's because their money printers are still going brrrr or because they're using an army of compromised hosts. What's stopping them is the difference in workflow and the lack of shared state among their various scraping processes. Doing it in the TLS handshake would make it more expensive for human users and not easier to bypass for unethical scrapers.
    
    sknebel
    
    that too, but I think hoistbypetards point is that that wouldn't even be needed, compute is just that cheap.
    
    strugee
    
    ...the result would be laggier pages for users (since the Anubis determination can be stored in a long-lived cookie, so users rarely do it repeatedly for the same site) because they'd redo the proof-of-work on every handshake.
    
    Not that I disagree with the overall thrust of your argument, but this point in particular isn't true if the site implements TLS session resumption. Session tickets are a widely-available and easy-to-scale way to do this AFAIK.
  - dryya
    
    Kind of sad that the only way to spin up a new web index, even if you want to crawl politely, is to turn to ~~botnets~~ residential proxies and the like.