Podman rootless containers and the Copy Fail exploit

31 points by ggpsv

dgl

This focuses too much on the released exploit rather than the primitive the vulnerability enables. Because the vulnerability allows writing to the page cache, regardless of whether it is read-only, it is possible for a malicious container to tamper with pages belonging to a file in a base image in overlayfs, which depending how the containers are deployed could cross to other containers. (In the rootless setup here, it would be other containers running as the same user on the host system).

An alternative exploit would be to run (or find) a container based on a base image that is known to be in use, tamper with the page cache in that container and then make another container (which shares the runtime and therefore overlayfs data) run that code.

While I think rootless and user namespaces are important, they really don't help here. The copy.fail site mentions that in a container it is possible to use seccomp to block the system call socket(AF_ALG, ...), that's the thing to consider in containers here.

ggpsv

Thanks for the insight, I had not given further thought to the underlying primitive.

I was certainly more concerned about summarizing my understanding of namespaces and capabilities as afforded by rootless containers in order to evaluate the exposure of a compromised container.

it is possible for a malicious container to tamper with pages belonging to a file in a base image in overlayfs, which depending how the containers are deployed could cross to other containers.

Can you expand on what you mean by "depending on how the containers are deployed"?

(In the rootless setup here, it would be other containers running as the same user on the host system).

The nice thing about rootless Podman is that, depending on your workloads, you don't need to run containers as the same user on the host. I guess you're alluding more to the scenario where I am running multiple rootless containers as my main workstation user? In that case yes, I agree!

If, however, you're running on a server you can keep each on a separate user. In fact, you could even run the same container image using different unprivileged users. This is quite unlike Docker, which defaults to running most as root. I do mention at the end that this is not the ultimate security boundary though, whether using rootless containers spread across multiple unprivileged users is appropriate will depend on your use case. In my case I do use VMs to separate certain workloads.

While I think rootless and user namespaces are important, they really don't help here.

Can you expand to what end they're not helping here? Do you mean in preventing the exploit?

I didn't get into seccomp specifically as I have yet to use an explicit seccomp policy with my containers. This is a good nudge however to explore that further!
- hoistbypetard
  
  Leaving aside GP's comments about the specific primitive copy.fail enables, which match my understanding of this particular vulnerability, I really liked your article as someone who has been experimenting with podman as a background task. I have explored quadlets, but have not yet moved on to building containers that specifically take advantage of podman features that docker doesn't really expose. (Or that I don't think it exposes...)
  
  Your piece was well-written and gave me some good ideas as I try to go that direction.
  - ggpsv
    
    Thank you. Having been in your position some time ago, I sought to distill what I've learned as to "demystify" what's going on beneath the hood.
- projectgus
  
  I'm a fan of Podman and rootless containers, but CopyFail led me to the same conclusion as the sibling comment. In fact it underlined the classic advice that containers aren't a security boundary, even with all the extra access control benefits of podman+rootless[*], it just takes one kernel exploit to cut through it all. :/
  
  I'm only a hobby sysadmin, but noticed a New New Thing in this space is using the libkrun backend for crun with podman. The promise is that you can treat most containerised workloads exactly the same, but behind the scenes they are now running in a MicroVM (separate guest kernel) without additional changes needed. Although I've no idea how mature, battle tested, or security audited this stuff is, parts of it seem pretty bleeding edge (and MicroVMs are being enthusiastically embraced for LLM coding tools, so it might stay that way indefinitely...)
  
  I always thought podman machine looked like another promising approach for this, but unfortunately it was only ever intended for developer workstations (max one container-running VM per host system).
  
  [*] FWIW I think this quote is still too simplistic, containers are clearly a security boundary - but perhaps not as strong as we'd like to believe.
  - david_chisnall
    
    Most cloud container deployments use VMs for the same reason: the VM is a defensible boundary. For local deployments, this line is a bit more blurry. There's nothing about VMs, from a hardware perspective, that makes them more secure than processes, but the boundary is more defensible for three reasons:
    
    VM exits are less common than system calls, so you can afford to do more side-channel mitigation without it hurting performance.
    
    VM interfaces to the host are much simpler. Block devices have a simple read-write interface for blocks, network devices send and receive network frames. The setsockopt calls that Linux or *BSD support on sockets is a much larger attack surface than most emulated or paravirtualised drivers, and is a tiny fraction of the kernel's attack surface.
    
    VM interfaces tend to have a lot less state. There are in-flight transactions on the rings in a request-response model, but very little else. Things like credentials, UIDs, GIDs, file descriptor tables, and so on all introduce stateful complexity into kernels that may be exploitable by processes if they contain bugs.
    
    The difficulty with the workstation variants is that you end up reintroducing some of this complexity. For example, the container base layers may be exposed as block devices containing immutable filesystems, but volumes and shared folders are probably mounted using 9pfs or VirtIO-FS (9p or FUSE over VirtIO). And that's now a bigger attack surface.
    
    If you're lucky, you need a chain of exploits. I'm more familiar with the FreeBSD versions of this, but they typically use Capsicum to sandbox the things that provide the paravirtualised / emulated devices, so first you need to compromise the host process, then compromise the kernel to get access to things the VM didn't have access to. But if you don't do that additional sandboxing, then you're back in a world where a container escape can do everything you can do, which isn't much better than a root compromise on a desktop.
  - 7tehdt3cnw6kir6o
    
    Although I've no idea how mature, battle tested, or security audited this stuff is, parts of it seem pretty bleeding edge
    
    Kata containers/Firecracker have been a thing for a while now, and have had researchers looking at them, I would consider them reasonably mature.
    
    I'm personally partial to gvisor; it's not a vmm runtime but it's also been around for a few years, is in use by companies such as Tencent, and works nicely for me since I already run all my containers in Proxmox VMs.
    
    Another thing I've been testing is syd-oci, which seems to have flown under the radar a bit compared to the default recommendations of microvms/gvisor.
    
    projectgus
    
    Kata containers/Firecracker have been a thing for a while now, and have had researchers looking at them, I would consider them reasonably mature.
    
    Yeah, for sure. I guess I meant specifically the libkrun+crun integration into podman rather than MicroVMs as a concept. I have a lot of time invested in setting up quadlets now, so I'm looking for options I can apply there. :)
    
    Thanks for the link to syd-oci, I hadn't heard of that one.
    
    7tehdt3cnw6kir6o
    
    Ah I missed that Kata doesn't support podman for some reason, in that case libkrun is nice. Gvisor works decently with quadlets ime, you just need to disable SELinux in the .container file and add --runtime runsc to PodmanArgs. If you're rootless you'll also have to sacrifice network sandboxing. Quadlets in general are really nice, they lend themselves well to being managed by ansible.
    
    syd-oci is a bit immature in terms of tooling and docs (the main docs for syd are superb though), but I like how many options such as exec control and egress allowlists it exposes to sandbox containers. Though depending on config it could be vulnerable to copyfail. Also, as vanishingly unlikely as it is, if there's one company that would shoot their golden goose and shutdown gvisor, it's Google. Already happened to kaniko which let you build containers inside unprivileged containers.
    
    ggpsv
    
    In fact it underlined the classic advice that containers aren't a security boundary, even with all the extra access control benefits of podman+rootless[*], it just takes one kernel exploit to cut through it all. :/
    
    Well stated, that matches my experience too and writing this articles was an exercise in "coming to terms" with that fact.
    
    Thank you for the libkrun reference, that seems like a promising possibility.
    
    rau
    
    Although I've no idea how mature, battle tested, or security audited this stuff is, parts of it seem pretty bleeding edge (and MicroVMs are being enthusiastically embraced for LLM coding tools, so it might stay that way indefinitely...)
    
    I think that enthusiastic uptake is likely to mature, battle-test and harden micro VMs, and likely to lead to security audits.