How uv got so fast

111 points by xvello

tef

Two tiny nitpicks:

Zero-copy deserialization. uv uses rkyv to deserialize cached data without copying it. The data format is the in-memory format. This is a Rust-specific technique.

This isn't rust specific at all, and there are libraries like FlatBuffers that achieve this in multiple languages!

Lock-free concurrent data structures. Rust’s ownership model makes concurrent access safe without locks. Python’s GIL makes this difficult.

Just to note: Rust's ownership model means that concurrent access needs locks. This is why intrinsics like compare and swap are unsafe in rust, and why lock-free libraries come with some sort of garbage collector. You can't drop things if someone else can read it.

In particular this is one of the reasons that the free threading build of python doesn't immediately free objects upon hitting a refcount of 0, and uses an epoch based reclamation system to actually know when an object is unreachable

llogiq

This is why intrinsics like compare and swap are unsafe in rust

Like the std::sync::atomics::Atomic*::compare_exchange method (I linked AtomicPtr here)?
- tef
  
  the word 'intrinsic' here is load bearing
  - llogiq
    
    Ok, so what? The intrinsics are not even stable, and we do have the safe types available. Why would it matter that the intrinsics are marked unsafe?
    
    tef
    
    i know you're gonna try and "aha but" this reply here but what exactly is safe about a *mut? how are you going to deref that?
    
    the point i am making is that compare and swap can't be expressed safely under single ownership, not without some form of lock free reclamation (i.e. gc)
    
    you can, at least, express a safe api over that lot (ala crossbeam), and that's the argument you should have tried first
  - weberc2
    
    I'm confused about the lock free concurrent data structures--how much time is actually saved by avoiding locking? Presumably locking and unlocking are very cheap unless it's being done in a tight loop? And if Rust's lock free data structures come with some sort of GC, isn't that GC likely to be more expensive than unlocking a mutex? What am I misunderstanding?
    
    tef
    
    In this instance the actual problem is this: parallelism is hard in python, as there's a global lock, and the free-threaded interpreter is yet to reach maturity. You're forced to use process level parallelism, which involves copying things to and fro, in order to make things work. This performance penalty can mitigate any speculative benefit from parallelism.
    
    In rust? You can parallelise things easily. Rust's single ownership model means there's no global lock, and anything that's globally visible is safe to access from multiple threads at once.
    
    (and my point is that this isn't the full story, and that rust does actually use locking for concurrency, or some form of gc with a global synchronization point or two, and you can't really do concurrency without some form of gc, as free threaded python shows)
    
    As for locking and locking: what you're touching on is the difference between pessimistic (locked) and optimistic concurrency (lock-free ish). Pessimistic locking is always slower, except in situations of high contention. Lock-free access is much faster, but with far less guarantees of progress.
    
    MatejKafka
    
    Pessimistic locking is always slower, except in situations of high contention.
    
    This is not generally true, since atomic operations tend to be more expensive than their non-atomic variants. For mostly uncontended data structures, it is often cheaper to lock & unlock a mutex (2 cheap atomic ops when uncontended) and then operate without atomics than to pay the overhead of atomic ops for most internal accesses inside the data structure. Good examples in this talk by Fedor Pikus: https://www.youtube.com/watch?v=9hJkWwHDDxs
    
    If you don't actually utilize the concurrency allowed by lock-free data structures (have significant contention), it often doesn't make sense to go lock-free. Even when you need concurrency, you can often get away with more granular locking and still be faster than a fully lock-free solution.
    
    JohnBlund
    
    Thanks for this! I learned something useful today.
    
    tef
    
    this is also not generally true! this is the problem with generalizations, there are outliers
    
    in the database world, sure enough you still need latches, but optimistic concurrency tends to have higher throughput because there's less shared metadata
    
    check out something like a "transactional mutex lock" for a nice example of optimistic concurrency
    
    jcelerier
    
    it is often cheaper to lock & unlock a mutex (2 cheap atomic ops when uncontended)
    
    but with a 0.00001% chance that it will lock, and yield, and be blocked by another thread on MS Windows waiting for some file to close for a couple seconds
    
    dprkh
    
    I don't think lock-free data structure come with a GC. Does anyone know what that is about?
    
    MatejKafka
    
    To support deletion in a granular lock-free data structure, you need some way to ensure that any element you're freeing is not currently "seen" by another thread – e.g., when you unlink a node in a lock-free linked list, there might be other pre-empted threads traversing the list that currently hold a reference to that node, and freeing it would result in a use-after-free when the thread wakes up.
    
    All common solutions for manually-managed memory that I'm aware of involve re-implementing something similar to a local GC to detect when all other threads leave the removed node, so that it's safe to free it, two common algorithms are RCU and hazard pointers.
    
    mjn
    
    There's some gray area around what exactly you count as a GC, but the "incinerator" described here is as least GC-adjacent: https://docs.rs/lockfree/latest/lockfree/
    
    tef
    
    the problem is that you cannot do lock free reclamation with just plain refcounting. check out python's free threaded pep for a longer worked example of why. you can do things with a tracing gc though, and it doesn't need to be stop the world.
    
    for less classical gc looking things: you can do it with some form of hazard pointers (a live list of pointers you trace... huh), or epoch based collection (a reference count of active threads) and uh yeah
    
    gasche
    
    PubGrub resolver. uv uses the PubGrub algorithm, originally from Dart’s pub package manager. pip uses a backtracking resolver. PubGrub is faster at finding solutions and better at explaining failures. It’s an algorithm choice, not a language choice.
    
    I was curious so I followed the link to PubGrub; it's also a backtracking resolver. (Which is not particularly surprising, given that it is usually not possible to guess which versions to pick without exploring the search space.) On first skim, it seems to be an adaptation of CDCL solving to package version constraints, with some built-in heuristics on which versions to prefer when several solutions exist.
    
    Gaelan
    
    What uv drops
    
    I find much of this section suspicious. Surely, on the order of magnitude uv is operating on, parsing a couple config files on startup, or checking for an old format that almost certainly be there, isn’t going to break the bank?
    
    I don’t doubt that omitting those features is a sensible engineering decision, but I don’t think there’ll be a meaningful performance benefit to doing so.
    
    llogiq
    
    Of course you could write uv in python and get almost the same speed? I don't think so. Unless you have a kernel of low-level implemented functions you call from python, you're always going to get interpreter overhead. And if you write the parts that need to be fast in Rust or (shudder) C, why add the complexity of FFI? At that point, it's going to be as easy to write it in Rust wholesale.
    
    xvello
    
    Comment removed by author