We don’t need no virtualization
22 points by Mordo
22 points by Mordo
some thoughts from someone who is passionate about capability systems and software isolation:
virtualization and containerization are mainly orthogonal concepts, and their use cases may or may not overlap depending on what you’re doing. containers are used, generally today, as a mechanism, not for sandboxing/isolation, but for software distribution. to hammer this point in, you have to do extra work to secure your containers! (see things like Google’s distroless images which very few projects use) they are not a secure sandboxing environment and have escape mechanisms (if you mount e.g. a docker socket into a container that run as root, your host is able to be 100% compromised).
virtualization on the other hand, is primarily about software isolation and resource partitioning, allowing you to securely run applications which need root/admin access without worrying about that compromising the host machine (minus CVEs and the like, ofc) and running your whole system out of RAM/disk space. but you can use virtualization as a mechanism for distributing software as well, it’s more annoying though. (for example, GitHub Enterprise is distributed as an OVF and expected to be run inside of a virtual machine).
more often than not, both of these concepts are used together. if you have a VPS (with emphasis on the P! containers don’t let you do things like AMD’s SEV), you’ll likely use a container runtime like docker or podman to run software on the VPS, your provider is in no way going to allow you to run an unknown container on their host machines because that opens up way too many security risks!
so yes, we do need virtualization. operating system isolation is an extremely important component of modern computing, and containers, while useful, do not provide anywhere near the same guarantees as virtualization does.
now onto the topic of capability systems and language-native capability systems:
I think both of these are really where the vast majority of new system software should be heading, but due to historical reasons, struggle to reach mainstream computing in an effective way. the primary issue with this is made of two independent parts:
combine both of these together, and proper capability-based security doesn’t really get you much in practice. if your application is written in a capability-based language but has e.g. an arbitrary code execution vulnerability or allows for FFI to a non-capability-based language AND if the operating system doesn’t enforce capabilities, it’s more or less a convenience for the software author and not an actual security mechanism.
now, if the operating system enforces capability-based security, this is where things get promising! it becomes much more difficult (and ideally impossible) to have privilege escalation vulnerabilities that allow for access to arbitrary parts of the system. you can then combine this with a capability-based security language to have actual guarantees about what kind of resources the application has access to, which is amazing!
does this negate the need for virtualization though? nope :) still necessary to run multiple guest OSs on a single host (which may not have capability-based security!). it also doesn’t entirely negate containers either, since as I mentioned earlier, they’re more useful for distributing software with complex dependency relationships, but capabilities allow you to provide more security guarantees than the current systems we have in place IMO.
bit of a wall of text, but hopefully it provides some interesting perspective :)
they are not a secure sandboxing environment and have escape mechanisms
You can configure docker and friends to properly sandbox OCI containers, but that’s still not the default for some reason. Worth also noting that LXC containers default to rootless, and they consider it a security boundary when running as such.
(if you mount e.g. a docker socket into a container that run as root, your host is able to be 100% compromised).
Just mounting the socket in a way that allows it to be used is enough - software in the container can just spin up a root container that mounts the root of the host file system in that case.
I think for software distribution they are even worse and things like distriless containers and similar projects of which there are many are a huge sign for that.
What people want for distribution is something they can easily pull and something they can start with environment variables evidently. There is such. Thing . It’s called a static binary. But it doesn’t even have to be that. Look at the JVM world. There’s that thing called a fat jar which basically is the same thing. Since Deno was mentioned. You even get out an executable of that
In my experience which is limited here is Python which somehow always turns out to be a mess there. And I certainly don’t know enough about it too know if it’s python per se or some package management related thing or just bad packages. Anyways, it thought me not to come close to it. Still, whenever people give up after trying it’s because of python. Meanwhile in JVM land (which I certainly wouldn’t call myself an expert on) you just replace the docker container target with JVM and you can even still run it on Nomad.
Just instead of running a Docker registry which is a mess in and of itself you can distribute it in whatever way you want.
And an archive for binaries that is largely indistinguishable from using genetic containers wouldn’t be hard. With WASI it’s even easier.
With not that hard I mean there are no huge unsolvable or even unsolved topics there. Having repositories for binaries aren’t exactly new.
But then you end up realizing it is easier to just run your binary yourself and put it there by whatever means you are already using. Some CI, some deployment tool, some Image you use to populate servers, etc.
I think the main benefits of containers for deployment were side effects. Like it forced people to make things straight forward to configure, it made people be clear of how and where state is stored. It established environment variables as a basic tool for configuration (be that good or bad, but it’s the standardization that matters).
None of these are really tied to containers but it gave sysadmins a way of saying “no, you can’t do that” instead of creating a bizarre, unmaintaible hack cause the dev said “works on my machine” and the manager seeing that asked the sysadmin/DevOps engineer/SRE to just replicate that.
now, if the operating system enforces capability-based security, this is where things get promising! it becomes much more difficult (and ideally impossible) to have privilege escalation vulnerabilities that allow for access to arbitrary parts of the system. you can then combine this with a capability-based security language to have actual guarantees about what kind of resources the application has access to, which is amazing!
What do you think of the new landlock capabilities in Linux? A step in the right direction?
Landlock is very definitely not in the capability-based security aisle. It’s classic policy-based security. You still access filesystem objects via their usual, “global” paths. Landlock just permits or blocks these accesses in a fine-grained way.
A bunch of stuff Linux already has is much more useful for enforcing capability-based security: Using namespaces, you can remove the shared global view of resources and with it “ambient authority”, which is an important prerequisite. And then instead of accessing resources using globally shared and policy-riddled identifiers like “/var/lib/foo/server.sock”, you pass around file descriptors that act as capabilities.
If you can read memory from a sanboxed JavaScript runtime, like in the slap demo, I doubt you can do much to prevent interaction between software running in the same address space. Can true isolation even be achieved with virtualization when the VMs are scheduled on the same Numa node? I’d still not offer to run a random workload next to anything I care about.
Slap is a vulnerability in a specific CPU. If that kind of thing disqualifies entire families of security concepts, we might as well look at all the QEMU CVEs and discard any ideas based on VMs whatsoever.
There are many other, all architectures, DDR, https://en.m.wikipedia.org/wiki/Transient_execution_CPU_vulnerability.
I was also discarding, personally, VMs as a sufficient safeguard for things like key material or untrustworthy code.
I don’t disagree with defense in depth, but I also wouldn’t say we wouldn’t need virtualization with better defense at the language layer, as the article suggests.
I’ve been thinking about this lately, but started from a different place: WASM on the server is an abomination, at least a 2x performance hit, a specification that keeps growing (SIMD, WASI, GC, etc). I agree with the primary point though: how can we ensure that a given piece of code can be somehow isolated from the system, even when considering bugs or even hostile code?
My idea would be to ship native code (x86/aarch/..) in an ELF, perform static analysis (syscalls are not allowed), and provide the “runtime” via a shared library, mimicking “libc”.
I’d do static analysis with something like LFI: https://github.com/zyedidia/lfi
If you’re bothered by the WASM overhead, you need to go back and resurrect CloudABI :)
It was basically WASI for native ELF.
What I don’t get is all the breathless hype for WASM. It’s just Java Bytecode redux, which itself was UCSD Pascal redux, which was itself IBM’s 360 redux. Seriously, this stuff has been around since the 60s! Why all the hype now? Just because Javascript is everywhere and no one bothers with history?
System/360 was not a software virtual machine, it’s an instruction set architecture with many different hardware implementations. VM/370 was not a software virtual machine, it was hardware-assisted OS paravirtualization.
P-code was not designed to support secure sandboxing. It was designed to simplify compiler bootstrapping, not to be a low-overhead JIT.
The JVM was not designed to work well as a target for low-level code. It was not designed to run modules written in different languages within but isolated from a host program written in C++.
TenDRA TDF / OSF ANDF were designed for architecture-independent software distribution and install-time LTO. No sandboxing.
Taos was designed to support processes that could migrate across heterogeneous hardware running a message-passing operating system. No sandboxing.
The things that make virtual machines interesting or not are mostly to do with how they interface with the surrounding system: what is the FFI between the VM and the host? where are the process boundaries? where are the security boundaries? how small can the overheads be? how easy is it to enforce security? how easy is it to JIT? how well do real implementations live up to the promises?
What I don’t get is all the breathless hype for WASM. It’s just […]
As someone who is both hyped about WASM and was around for most of the things it recapitulates, I sympathize.
I think humans have both a deep bias against “failed technologies” and a bad case of NIH. When combined, reviving a past approach is essentially impossible.
On the bright side, those of us familiar with the past can better understand the present and predict the future.
Have you looked at PNACL? https://www.chromium.org/nativeclient/pnacl/introduction-to-portable-native-client/
Sounds like you’d be interested in its approach.
I’d forgotten about it, thanks for the link! I’ll check it out
It’s really cool and I kinda wish it had taken off over wasm. Wasm seems to have completely stalled, it’s being pulled in so many different directions, it isn’t meeting the potential people had hoped for re: performance, etc. But… PNACL was doing all of this years ago.
According to the linked page in grandparent, PNaCl uses the Pepper API. My recollection (which could totally be incorrect, it’s been so long) is that the Pepper API was essentially exposing a bunch of Chrome internals and so practically speaking was completely unimplementable in other browsers. I believe this is also why the PPAPI plugin interface, which was also based on Pepper, was also unportable to other browsers.
That’s true, iirc. I believe the Pepper API was used early on for something like the Flash sandbox. My recollection was that PNaCl itself didn’t strictly rely on it though, it was just how it achieved the additional sandboxing layering.
It was always dead in the water. LLVM Bitcode was never intended to be a stable interface.
According to the Chrome design doc, they used a stable subset of the IR and had a translation layer from unstable actual LLVM IR to their stable subset.
Yes, it’s the only sane way to do it. But you still run into the issue that LLVM IR isn’t designed for this use-case. It’s too low level in the wrong ways. WebAssembly is just straight-up better.
That said, I don’t disagree with @insanitybit that post-1.0, Webassembly has been kind of disappointing in various ways. But PNaCL would have been worse.
WASM on the server is an abomination, at least a 2x performance hit
Is it? Honest question, not a performance engineer, and I have no idea one way or the other. But I’ve always found myself skeptical of the claims that WASM is near-native in performance when I consider all the complication of a WASM stack versus how many times I’ve heard about seemingly-tiny things making a large difference in performance.
It’s near native if you compare to something like python.
Going through an interpreter, having no SIMD will destroy preformance on compute heavy workloads. Then, any communication with the host is also quite slow, so having many small calls to/from WASM is also slow
WASM has many implementations that JIT, including all major web browsers and Wasmtime & other native implementations, and they all support the WASM SIMD spec. I’m not saying that WASM isn’t slower than native, but the way they get to 2x performance is by JITting and having SIMD, that’s not the reason it’s not on-par.
I like this approach a lot. I’ve been doing something similar to deterministically accelerate RISC-V ELFs on aarch64. The static analysis check could also be made stricter if you assume that users are going to be compiling through gcc or llvm backends, instead of allowing arbitrary hand-written asm.
OP Seems to be conflating containerizing and virtualization. Containerization is the tech that is normally used for process isolation, and virtualization is for running programs that would not normally run on that specific. Of course, a lot of virtualization these days is for process separation, or more efficient use of limited hardware resources.
But containerization itself has a long history of process isolation methods that work variably well. A lot of what the OP wants can be done with a simple chroot, with almost no overhead, or FreeBSD Jails. At the end of the article he mentions Docker, which is another, more advanced form of containerization, which is vastly different from virtualization.
I think your idea of what “virtualization” means is too narrow. Virtual machines are not the only type of virtualization. Containers are commonly referred to as OS-level virtualization. That is, VMs virtualize the hardware interface whereas containers virtualize the OS interface.
lot of what the OP wants can be done with a simple chroot,
No. I mean, I guess everything can be the same if you just squint hard enough, but chroots solve some aspects of file system isolation, but they do nothing for the isolation of memory, network, and other resources.
The article ends up hinting torwards solutions with less (computational, but primarily operational) overhead than either docker, which is mentioned as a negative example, or jails.
A lot of what the OP wants can be done with a simple chroot
I think they want to replace multi-process isolation with in-process isolation. It can be done with scripting languages that support safe sandboxes (Tcl, Lua, Javascript) and/or with capability-secure languages. Modularity in object-capability languages is built out of dependency injection, as mentioned in the article.
What a bizarre article. It sort of talking about managed runtimes without talking about them at all.
OP Seems to be conflating containerizing and virtualization.
Yes: well, the sole mention of containers at all is a single reference to docker, which certainly implies they’ve conflated the two concepts. If that wasn’t there this whole article could have been written in 2013. That reference made me chuckle for different reasons:
Imagine not having to build a Docker image every time you wanted to push a code update.
People actually wanted that! It was one of the whole big gasp moments when docker was unveiled. Do you realise what people did before that?
Addressing isomation at the language level rather than enforcing it from the OS means that you give full control and responsibilities to the developers of an application.
From an entreprise standpoint this is already borderline, and, as part of the ops team, would never trust a 3rd party app to run it alongside other applications on a baremetal server.
Virtualization and containerization allows one to enforce applications to comply to their requirements, rather than blindly trusting it to do what it says. Even with mechanism like pledge(2)
and unveil(2)
, you cannot trust that the application to use them properly. And even when they do, you still have no control over the policies they will setup, which could simply be too wide for your own tastes.
You can run pledge and unveil in a process that you control, then exec the untrusted process into it. You don’t need to trust the application to do it right.
Of course but that would mean running a process supervisor atop of your application, which is a sort of containerization mechanism. It would defeat the idea of the author here which is that applications could isolate themselves at the language level.
The author is arguing for language-level isolation, not application-level. If the language provides isolation guarantees then pledge and unveil could be some of the techniques used to enforce those at runtime.
Haskell (GHC at least) has the SafeHaskell extension, which loosely speaking ensures the types of your program are “true”. This makes the “static analysis” option trivial! But memory and compute aren’t usually/always an effect, so you still need to find a way to limit that.