Your Container Is Not a Sandbox

27 points by jryans

dpc_pw

It's a silly black and white thinking to say containers are not a security boundary. They are. Here is a potential VM escape vulnerability. Does it mean VMs are not a security boundary either?

Container security is weaker and attack surface larger, so they will get broken more often. Much more often. Is it a strong enough boundary depends on what you are trying to secure. If you are e.g. trying to wrap your coding clanker you are probably worried about it randomly decide to modify or send data it was not supposed to touch and chances it will use a container escape are near zero. Container is going to be good enough for that. If you are trying to security public infrastructure where anyone can upload arbitrary code at any time, container are not strong enough and too risky.

jmillikin

In some sense a Linux process is a security boundary, because the kernel prevents processes from writing to other processes' memory space or performing unauthorized syscalls. A benign (non-malicious) process can be assumed to not bypass these protections.

In another, more important, sense a Linux process is not a security boundary, because the Linux kernel is assumed to have undiscovered LCEs that allow arbitrary processes to become root and/or execute code in kernel space. A security boundary is only needed when you've got malicious code on one side of it, so anything that doesn't protect against malicious code isn't a security boundary.

When someone says "containers are not a security boundary" they're usually speaking from the second position, with the understanding that sandboxing with gVisor / Firecracker / seccomp to read+write+exit / WebAssembly are security boundaries and chroot isn't.
- georgelesica
  
  because the Linux kernel is assumed to have undiscovered LCEs
  
  sandboxing with gVisor / Firecracker / seccomp to read+write+exit / WebAssembly are security boundaries
  
  Genuine question: couldn't any of those tools in the second quote have undiscovered vulnerabilities as well? I'm certainly not saying the odds are equal, just that it still doesn't seem, to me, like you can make a black and white distinction.
  - jmillikin
    
    In theory yes, but in practice (1) they are far smaller than the Linux kernel, (2) most of their code is memory-safe (Go and Rust), and (3) they're written with the explicit goal of being security boundaries.
    
    That's not to say that it's impossible -- there are historical cases where (for example) a WebAssembly implementation that is particularly aggressive with JIT had a sandbox-escape bug -- but the odds are overwhelmingly in favor of the hardened sandbox vs the written-in-unhardened-C 30-year-old 40-million-line rapidly-evolving Linux kernel.
- vbernat
  
  At some point the author tells that Claude Code was able to workaround bubblewrap and escape. First, it was requested to do it, second Claude was not running inside bubblewrap, but was using it to implement some sandboxing. So, it decided to stop using it. The sandboxing model of Claude Code and similar agents is just clunky.
  
  The article is a bit self aggrandizing. MicroVM are not new and they come with their own set of problems (sharing the filesystem or the network).
- emk
  
  I agree that it depends on your use case. Before you can say something is a "security boundary", you need to know your threat model. Are you running VMs for untrusted users on shared hardware? Then you need to lock things down hard.
  
  But the threat model of agentic coding is usually something like "trying to keep Opus or a local Qwen model in a box." Which something like Linux user namespaces are actually pretty good at. Under normal circumstances, Opus and Qwen aren't trying to escape. And if they did try to escape, they'd mostly fail. Mostly what you're protecting against is an agent that has gotten a bad idea in its head and started to poke where it shouldn't. Realistically, the worst case scenario is probably an agent that read something online and got prompt-injectected with very specific instructions.
  
  (This might be different with Claude Mythos, which allegedly sounds like it might be able to research its own zero-day and punch right out.)
  
  These days, I'm more concerned by the threat model of npm install and pip install, thanks to all the supply chain compromises. Multiple prominent packages have been compromised by sophisticated malware. This threat can be mitigated in a number of ways, but the tooling is mostly horribly immature.
  
  But for agentic coding, there are also good reasons to use tools like user namespaces. Often, the agent is working directly with a human, and needs to share a working environment with the human's editor. This means shared filesystems, shared toolchains, etc. And user namespaces allow you to construct that shared view very precisely. Yes, mostly I think the sandboxes and permission systems built into the agents themselves are terrible. But that's because they make bad tradeoffs.
  
  And browser sandboxes also run on top of non-VM primitives, and that's the biggest exposure most users will ever have. So again, this comes down to threat models, and what tools you use to build your sandbox.
  
  (Finally, the list of container escapes in the original article feels weirdly weak. There are build-time escapes, NVIDIA-specific escapes, and a lot of other niche stuff that looks more like configuration issues than serious bugs in the container machinery itself. Which is weird; there are actually plenty of kernel CVEs that allowed you to punch out of Linux containerization mechanisms. So the list feels more like sloppy LLM research than anything else.)
qznc

So what is the easiest way to sandbox AI's locally? Apparently, libkrun is not yet packaged in Debian, so it isn't as easy as installing that and telling podman to use it.
- ocramz
  
  libkrun
  
  wow, TIL!
- Thornycrackers
  
  I’ve found incus to be the easiest. It’s pretty straightforward to mount your current directory inside the container and give it its own docker to run separately from your host docker.
- matheusmoreira
  I developed my own solution for this: virtdev
  
  virtdev key virtdev iso virtdev install virtdev seal virtdev create myproject virtdev start myproject virtdev ssh myproject virtdev stop myproject
  
  I've been using it every day. Made it incredibly easy for me to manage and use my local development virtual machines.
- mediremi
  
  I've been using Gondolin, which has both QEMU and krun backends.
- icholy
  
  I use https://github.com/nestybox/sysbox
- k749gtnc9l3w
  
  Funnily enough, it looks like libkrun project page implies that libkrun VMs are better to run inside more or less containers, doesn't it?
- tuxes
  
  I mostly want to prevent the AI accidentally operating outside of its intended directories.
  
  I run agents as a dedicated unprivileged user that I SSH into. I then wrap the agent with https://github.com/Zouuup/landrun (to limit FS, per session), and systemd-run so that it cleans up properly, and to limit machine resources usage.
altano

I recently converted my entire homelab over to a single, colocated machine in a datacenter. Having a giant zfs pool with sensitive content on the root host meant running public facing web services in containers as a security boundary was an absolute no-go. Proxmox being both IaC and microvm unfriendly meant I had to rethink my setup.

I settled on running NixOS on the host and every service in its own cloud-hypervisor. Even docker containers get their own microvm, as the article discusses. NixOS + microvm.nix make this trivial to setup.

It has some downsides, eg deploys are crazy slow and use a lot of memory. But I would 100% recommend this setup. It’s GREAT.
- winter
  
  Having a giant zfs pool with sensitive content on the root host meant running public facing web services in containers as a security boundary was an absolute no-go.
  
  I imagine that you don’t trust e.g. filesystem ACLs (isolating the sensitive content to service by having distinct users), or that it’s not possible with the combination of services you want?
  
  (I ask because NixOS heavily sandboxes a lot of systemd services using its (systemd’s) native sandboxing primitives, but it's more akin to a container in that regard due to just using namespaces… but still, this sandboxing with the right path bindings + e.g. FS ACLs to boot is probably a fine choice too.)
  
  Also: there are ways to deploy each microvm individually as opposed to doing it all apart of your host config [0], is there a reason you didn’t go that route?
  
  [0]: https://microvm-nix.github.io/microvm.nix/ssh-deploy.html is one way, though there are others
  
  (Sorry, this comment turned into a bit of a ramble… but hopefully you can make some sense out of what I mean. 😅)
  - altano
    
    I imagine that you don’t trust e.g. filesystem ACLs (isolating the sensitive content to service by having distinct users), or that it’s not possible with the combination of services you want?
    
    It's just a question of how hard it is to get right. With MicroVMs I can virtiofs share a zfs dataset dedicated to that VM and be confident I didn't get some setting wrong and open up more of my system to the service. I want to be a little cavalier about what services I run, not vetting every single service and its track record. Meanwhile, as an example, with Docker it's trivial to get tripped up by its firewall rules and accidentally open something up to the world.
    
    So it's not a question of "a microvm can be locked down much harder than a container" but rather "in my casual and novice use of these technologies I'm only willing to use the one where the security boundary is extremely straightforward and easy to get right, out of the box"
    
    Also: there are ways to deploy each microvm individually as opposed to doing it all apart of your host config [0], is there a reason you didn’t go that route?
    
    I have a slight preference for the full-declarative setup. I don't think it's slower, if that's why you mention it. I already use deploy-rs (and use it to deploy the microvm host) so deploying multiple systems is trivial for me. Let me know if you know something I don't!
    
    winter
    
    I have a slight preference for the full-declarative setup. I don't think it's slower, if that's why you mention it.
    
    You mean the setup where you define all microVMs alongside the host, right? That naturally has to be slower and use a lot of memory (as you already said in your original comment) because of all the distinct NixOS system evals every time you make an update to even one of the VMs.
  - weberc2
    
    This feels more like an indictment of Linux’s security posture/architecture than containers.
    
    strugee
    
    Agreed. I always wondered if Zones or Jails-based containers were more secure (or theoretically easier to make secure), simply because there's less moving pieces than the hodgepodge of Linux technologies that "containers" on Linux are.