The future of Python web services looks GIL-free

41 points by gi0baro

All these years past and it still puzzles me why people have this weird fetiche of fiddling around with asynchronous Io manually with confusing semantics.

What exactly is the problem por writing synchronous applications with wsgi and deploy them to web server than handles concurrency asynchronously? Nginx, eventlet, decent, etc.

lpil

The wsgi approach has orders of magnitude higher resource use per request, due to needing an entire VM instance per request, and a much higher context switching cost, due to using the operating system's scheduler.

This results in higher server costs, worse latency, an inability to prevent multi slow requests from impacting overall site performance, and much worse ability to handle changes in amounts of traffic.
- andyc
  
  This seems a bit misleading … The only way to use all your cores is with your OS scheduler, whether with processes or threads. (Threads in python come with the downside of the GIL of course, so in that case you need processes)
  
  Machines have 16, 32, .. 128 cores now
  
  The OS scheduler doesn’t just context switch; it also decides what cores processes and thread run on. You can have true parallelism with zero context switching.
- Forty-Bot
  
  an inability to prevent multi slow requests from impacting overall site performance
  
  OK, but multiple slow requests are going to impact site performance any way you slice it. If they're slow due to CPU, then there's no advantage to async because CPU-bound tasks run in the foreground either way and have no scheduling overhead. If they're slow due to I/O then async will be faster only if the I/O subsystem can handle more concurrency than you have CPU cores. And the case for async only becomes important when there is a big imbalance.
  
  IMO the only place where you really need asyncio is if you have long-lived connections with the client, like websockets. For normal web applications, it's perfectly fine to push the async down to the HTTP proxy and use a process/thread pool.
  - rtpg
    
    If you have "most" requests take ~100ms, but then some take 10s, then those 10s requests are taking 100x the time, even without necessarily taking 100x the I/O etc.
    
    Things like caching, having different databases for different tenants, and the general heterogeneity of workloads etc mean that your pipeline for processing requests can branch in many different parts of the stack. But if we're paying 400 megs+ per "frontend worker" (WSGI instance) suddenly that can be a massive bottleneck.
    
    by having cheap per-worker counts you can end up doing a lot of interesting Quality of Service stuff lower down the stack. Things like spinning up a bunch but then distributing them in ways to make each kind of worker deal with only a subset of some kind of work. If workers are expensive... harder to do that, and you have to overprovision most of them to deal with this.
    
    Obviously not a panacea by any stretch, but wehen you have super heterogenous workloads then being able to spin up many nodes gives you a lot of operational flexibility, even if your downstream I/O is still limited
    
    Forty-Bot
    
    But if we're paying 400 megs+ per "frontend worker" (WSGI instance) suddenly that can be a massive bottleneck.
    
    What? At least on the server I run each wsgi "instance" (uwsgi worker) is 65M rss.
    
    but then some take 10s, then those 10s requests are taking 100x the time, even without necessarily taking 100x the I/O etc.
    
    OK, so if it doesn't take 100x the I/O, that's better for synchronous workers, since CPU is inherently synchronous. But even assuming that you're doing a lot of I/O, you can just provision enough workers to absorb a few long requests. async only wins when every request is I/O intensive to the point that you want to have major oversubscription. Maybe that happens if you go all in on microservices.
    
    rtpg
    
    enterprise SaaS monolith WSGI instances can be chunky. 200 megs is something I see normally, and I've seen... uhhh... much more from time to time.
    
    But even assuming that you're doing a lot of I/O, you can just provision enough workers to absorb a few long requests. async only wins when every request is I/O intensive to the point that you want to have major oversubscription
    
    even at 200 megs it's like... 2 gigs for 11 workers, right? But it could just be 200.
    
    At the Enterprise SaaS "lots of kinda wasteful I/O" level you can really easily overrun this stuff at awkward moments. When we could... just not!
    
    Another aside: frontend workers are one thing, but if you also have background workers using a shared codebase then there the heterogenous stuff gets even worse! Some sync process that is dependent on some (slow) vendor API call in bulk operations.... do you really want workers to lock up?
    
    There's obviously costs here but the benefits are so obvious to anyone who has had very annoying perf issues almost entirely downstream of all their stuff getting locked up by I/O in mixed workload configurations
    
    Forty-Bot
    
    OK, but these are process-based WSGI instances. If you're I/O bound you can perfectly-well use thread-based instances, which should have a much smaller memory footprint. I am less familiar with those because my workload is CPU bound.
    
    pm
    
    You completely ignored everything I said.
    
    The wsgi approach has orders of magnitude higher resource use per request, due to needing an entire VM instance per request, and a much higher context switching cost, due to using the operating system's scheduler.
    
    This is absolutely not true and virtually every single wsgi application out there runs on some other concurrency model other than fork. I have worked with hundreds of web services and it had been almost 20 years since I last saw a deployment on a fork based service pr even on operative system threads for that matter. I'm talking about several years before ASGI even existed.
    
    I guess the answer is people are really confused.
    
    lpil
    
    It is true though. They all have much higher overhead.
    
    pm
    
    Of course they don't have higher overhead. Why would they?
    
    WSGI and ASGI are just two standardized function signatures. Two interfaces if you will.
    
    You can even write your own basic wsgi server using async python. Many wsgi did just that.
    
    andyc
    
    people have this weird fetiche
    
    Have you considered that people need to solve problems other than the ones you've worked on?
    
    asynchronous Io manually with confusing semantics
    
    I would not call async I/O "manual" ... I would put it on a spectrum:
    
    nginx-style evented C code / state machines - the most manual
    
    node.js style callbacks - slightly less manual, you can inline functions
    
    JS async/await and Python asyncio - now you get your stack back
    
    ...
    
    What exactly is the problem por writing synchronous applications with wsgi and deploy them to web server than handles concurrency asynchronously? Nginx, eventlet, decent, etc.
    
    Let's ignore resource usage, since I think that's not the most important reason
    
    asyncio is useful for:
    
    waiting for responses from multiple backends in parallel, with timeouts and cancellation
    
    supervising child processes robustly, with timeouts and cancellation
    
    threads and processes don't mix
    
    backpressure / admission control
    
    Many people work behind a proxy like nginx, with request-level timeouts. That is a good default, because it's simple and you don't really have to think about it. Your code is kinda bad, and that's OK because it's handled at a higher level.
    
    But to write robust distributed applications, you generally need more control than that.
    
    atk
    
    I wouldn't put more efficient use of resources as less important than the reasons you listed.
    
    This might not matter for some use-cases, but I don't think there are that many of those use-cases where it doesn't matter.
    
    pm
    
    Have you considered that people need to solve problems other than the ones you've worked on?
    
    Of course I have. I just can't find anyone that enlightens me of what exactly is the need to write custom async code at the business logic level of an HTTP server, so to speak
    
    nginx-style evented C code / state machines - the most manual
    
    Well yes. But nginx os provided to me as ready to use product. Its whole value offer is that they do the heavy lifting concurrency work for me so I don't need to. I think it's great that they did it and that it works. So I don't need to do it myself. That's my point.
    
    JS async/await and Python asyncio - now you get your stack back
    
    The browsers JS engine has always been predominantly async by default. It was created to handle UI events on a webpage, it wouldn't make sense to create a bunch of APIs that block the UI. And of course XMLHttprequest() although older, was obscure till the mid 2000s. And other than that there was no IO. So I don't think it makes sense to talk about concurrency models the same way when it comes to JS. They just went with async handling of any event.
    
    Then along came nodejs which uses a JS engine extracted form a browser.
    
    Let's ignore resource usage, since I think that's not the most important reason
    
    I am not sure you understood my previous post. It's not a less important lesson. But rather it doesn't exist if you deploy your wsgi application in uwsgi (for example). It won't copy stacks around nor fork the python VM. It used green threads.
    
    andyc
    
    Of course I have. I just can't find anyone that enlightens me of what exactly is the need to write custom async code at the business logic level of an HTTP server, so to speak
    
    I answered that in the part of the reply you didn't quote
    
    untitaker
    
    it's not clear to me what approach you're advocating for, but I'll try to reply.
    
    people do not like gevent/eventlet in combination with sync code because the amount of monkey patching one has to do makes it fairly brittle. people do not like multiprocessing due to very high RAM usage and bad startup times. Both of those downsides are actually briefly mentioned in the last few paragraphs.
    
    pm
    
    The second approach is virtually non existent. I haven't seen anyone running web services like that since the early 2000s nor do I know any production ready wsgi server that does it.
    
    I am advocating for the first one. Which to my knowledge is how must applications are still deployed. What exactly is brittle about it? I have never had problem with them. And I've used tornado, gevent, eventlet, nginx, cherry.py, etc since they were first released.
    
    You are saying that ASGI is better than something that doesn't exist practically speaking. This is what confuses me.
    
    untitaker
    
    uwsgi works this way, very bold claim to say everybody is using something like gevent. we (sentry) use it extensively in production.
    
    What exactly is brittle about it?
    
    what is brittle about monkeypatching the stdlib? well, for example when contextvars came out they plainly didn't work correctly on gevent or eventlet, and in case of the latter it took such a long time to also fix that (by adding more monkeypatching) that I thought the project was dead. until then and until people upgraded, contextvars were simply not usable in web workers, and any library that relied on it would exhibit concurrency bugs.
    
    pm
    
    WSGI is just the interface. Of course everyone deploys their applications on a setup with async Io based concurrency. Which wsgi servers have you seen used in production and with which configuration?
    
    untitaker
    
    read again, I said uwsgi (pinky promise I didn't edit that in)
    
    pm
    
    I red wrong yes. But my claim is not that people use gevent only. In fact I would say nginx or gunicorn are more popular. I am claiming that virtually everyone deploys their wsgi applications in servers whith concurrency power by async primitives. Such as uwsgi which AFAIK is green threads base.
    
    If the webserver does this for you and had done for almost two decades, why are people so eager in writing async code?
    
    untitaker
    
    uwsgi does not use green threads as a concurrency model for the application and I don't know what you mean when you compare nginx and gunicorn.
    
    I think you are conflating "how does the server manage sockets" with "how does the server manage concurrency of the application". the former is not relevant to this conversation.
    
    gi0baro
    
    I am claiming that virtually everyone deploys their wsgi applications in servers whith concurrency power by async primitives. Such as uwsgi which AFAIK is green threads base.
    
    It's not. It's real OS thread based. And due to that, you can't escape the GIL. So unless you have something like gevent, you have no async based primitives.
    
    If the webserver does this for you and had done for almost two decades
    
    They don't. They never did. That's the thing. Having async code is just the modern approach to having gevent/eventlets running on an event loop.
    
    pm
    
    It's not. It's real OS thread based
    
    It is. You can pick you concurrency engine.
    
    It very clearly states that it supports a multitude of async concurrency engines as its selling point on its landing page.
    
    Which even further proves my point: you can develop your application normally using sync functions and deploy it to a wsgi compatible server, which even has pluggable concurrency engines.
    
    This was also Guido's motivation to introduce wsgi. A common interface that servers can implement that will be compatible with any application implementing that simple web application interface.
    
    Personally, I never seen anyone using uwsgi (or other wsgi server for that matter) using fork or threads based concurrency.
    
    gi0baro
    
    It very clearly states that it supports a multitude of async concurrency engines as its selling point on its landing page.
    
    You can't use --async and the async backends in uWSGI unless you either:
    
    have yield statements in your app every time you do something blocking (and good luck using a framework with that, given none of them is designed around this)
    
    manually call uwsgi.suspend() in your code
    
    use gevent, which again requires you to mokeypatch your code
    
    Stating you can get async concurrency over WSGI code just using a server is wrong and you're just spreading misinformation. You can have whatever non-blocking server on top of your code, but if that code is blocking, it won't change a thing compared to have everything blocking. 'cause the non-blocking part will just wait for the blocking one to complete, increasing latency.
    
    I also don't get what you're trying to convey here, all these discussions are quite off-topic from the OP; seems to me you're trying to steer the discussion somewhere else, and I don't get why.
    
    pm
    
    Not sure about uwsgi, I would have to read its code, but eventlet and gevent do exactly that by means of monkey patching low level IO primitives. I believe you could also achieve similar results by wrapping all calls to your wsgi application with a set of async executor. I am sure there are other ways too which do not occur to me at the moment.
    
    scraps
    
    a handful of things but database connections can’t easily be shared across workers when the workers are processes. If your model is each worker handles one request at a time, you wind up with M*N connections where M and N are the number of nodes and number of workers. If you can share the connections using an async event loop model you can lower the connection count because each process is handling more things concurrently, and sharing within the process is easier. And it makes it much easier to, you know, do concurrent things, like make two API calls to a third party concurrently and then join them.
    
    calvin
    
    PHP did try to work around this with persistent connections shared between worker processes, but they can be pretty error prone in my experience.
    
    scraps
    
    ha, I’m probably one of few people who would agree with praising the PHP runtime, I was on the core platform team at Etsy years ago. PHP had basically all this figured out years ago and you might see a lot of its ideas come back around again in Python. 3.14’s InterpreterPoolExecutor has a lot of what’s required to mimic mod_php’s shared-nothing concept, although mod_php has the opcode cache, I’m not sure if InterpreterPoolExecutor has that.
    
    https://docs.python.org/3/library/concurrent.futures.html#concurrent.futures.InterpreterPoolExecutor
    
    calvin
    
    I've got an @php.net, I actually like the PHP runtime implementation. PHP's structured request lifecycle makes a lot of sense for what it's good at, and I'm surprised no one else imitated it.
    
    Corbin
    
    There was no problem. Rather, for speed, the fastest tools in Python happened to be async tools. The fastest Web framework ever measured was Cyclone, which married Tornado's API to Twisted's netcode. Twisted's Web dispatch is slow because it's hierarchical, whereas Tornado uses regexes to match routes; the combination could run on PyPy and was faster than everybody else's offering. The only nuance to the semantics is that events can be partially ordered instead of totally ordered; previously, on Lobsters we recently dug into this. Python's async semantics are pretty simple compared to e.g. E's async semantics!
    
    That said, nobody really valued dispatch speed as the primary metric, so Cyclone's long-since rotted and I can't find a copy of cyclone.io/images/benchmark.png or other evidence. The speed fetish is still there; many of us still use PyPy. But I never adopted Cyclone because Twisted Web may be slower but it is also more modular and compositional, allowing much better code reuse.
    
    pm
    
    That said, nobody really valued dispatch speed as the primary metric
    
    Because it effectively isn't. Except for esoteric rare cases, dispatch speed is orders of magnitude faster than IO.
    
    talideon
    
    I know they bring up stuff based on asyncio in the post, but free threading has nothing to do with asynchronous I/O. Your comment is something of a non sequitor. It's perfectly legitimate to benchmarks to approaches to I/O (blocking and non-blocking) against a change like this as it may uncover either bugs or hidden locking, or the very opposite.
    
    pm
    
    The post is comparing the performance of WSGI applications (with or without gil) and ASGI applications. It is everything to do with async IO, it is essentially comparing how threading compares with async io.
    
    They put a wait() call to simulate IO, this ont be picked up by the event loop so I don't think it is a good benchmark.
    
    But to my point, what I Don't get is this endless comparisons when you can just deploy a wsgi on an async web server. Which virtually everyone does.
    
    gi0baro
    
    The post is comparing the performance of WSGI applications (with or without gil) and ASGI applications
    
    You're deliberately misreading me. The post compares CPython 3.14 and 3.14t on a WSGI application and an ASGI application. At no point I compared WSGI and ASGI.
    
    And I honestly don't understand why you're trying to steer the discussion towards that direction.
    
    vegai
    
    No problem at all, but I believe threading is useful for entirely different things than web service handlers. Although python is almost certainly not suited for those things...
    
    ngoldbaum
    
    Would you mind if we link to this post from here? I’d also really appreciate your input - or anyone reading this - for the docs as a whole from a Python web developer perspective.
    
    gi0baro
    
    Sure thing, feel free to link it wherever you prefer.
    
    As for the docs, I'm not sure what exactly you're asking for.. WSGI has its own PEP, ASGI has its own documentation. But mostly you would read the documentation of the specific web framework you're using.
    
    ngoldbaum
    
    Sorry, I meant the docs in the free-threaded Python guide. Particularly the parts about pure-Python thread safety and multithreaded testing.
    
    gi0baro
    
    Now I see.
    
    Gonna read in the next few days and get back to you with feedbacks (if any).
    
    alper
    
    I don't think I can figure out what the difference between 3.14 and 3.14t is. It's not in the article.
    
    0x2ba22e11
    
    Same. The rows labelled "3.14" are measured with the GIL, not free-threaded. The rows labelled "3.14t" are measured with free-threading, without the GIL.
    
    (Also, in case it wasn't obvious: python3.14 allows switching the GIL on or off optionally because there's a small single thread performance penalty for turning it off.)
    
    martinkirch
    
    My interpretation of this benchmark is the opposite : numbers suggest it's not worth leaving classic workers. Especially if it's slower. The whole point of "real" threads is the ability to share memory between active cores: that might allow architectural changes. A come-back of stateful apps! I'm surprised it's not mentioned. Still, after 2 decades of stateless microservices I dont expect much people to play such dark magic either.
    
    scraps
    
    The whole point of "real" threads is the ability to share memory between active cores
    
    I think the more common need in HTTP servers is to share a connection pool to a database, since what state is shared is likely going to be held in a database and not in process memory.
    
    A come-back of stateful apps!
    
    Web apps, when you consider the entire stack, aren’t really stateless, the state is just in a database. The HTTP server (the app layer) is a stateless layer of the system but it’s not the whole system.
    
    As for stateful app layers where engineers are regularly working on a stateful layer, they already came back, that’s what websocket is. Most websocket applications have some variety of statefulness in them. But making your app layer stateful introduces new operational challenges that are generally best avoided if there’s not a meaningful reason why a stateful app layer is useful and you can use a stateless app layer with all of your state contained in some database or other appliance.
    
    gi0baro
    
    numbers suggest it's not worth leaving classic workers. Especially if it's slower.
    
    Not all the numbers are worse. Especially given 99.9999% of Python web services out there have an average response time >> 10ms. I'd also say it vastly depends on what you value the most. As I wrote in the last section, less memory usage and spending less time playing with threads in production might actually be more valuable than a 20% reduction in raw throughput.
    
    The whole point of "real" threads is the ability to share memory between active cores: that might allow architectural changes
    
    I don't consider that as the whole point. As a matter of fact, I can't foresee any major architectural change given it won't be usable the moment you need to scale above 1 machine.