the expectation of being able to run docker whenever in CI jobs is probably the single worst outcome of free GitHub Actions minutes because reproducing it in a bring-your-own-compute environment is borderline impossible unless you make every machine single-tenant
@whitequark Don’t all modern CI systems run each job in an ephemeral VM? It’s about the only security boundary that I’d think you could defend against someone able to run arbitrary code these days, unless you lock down the environment so much that CI can’t do things it needs to do.
@david_chisnall Forgejo Actions runners offer you a choice of "Docker", "Podman", "LXC", and "lol rawdog it on the host"
@david_chisnall right now I'm using rootless Podman and I think it's defendable enough that I'm okay offering it to friends (who may still click on Approve & Run from a sketchy source, mind) but it's not letting cibuildwheel or other Docker-expecting applications run which is a problem

@whitequark

Yup, if your threat model is ‘friends who aren’t actively attacking your infrastructure and will do at least a bit of checking before they hit approve on PRs from other people’ that’s probably fine. I don’t think I’d trust any of those mechanisms for a high-profile project though, given how often privilege-elevation bugs in the Linux kernel are found.

I wouldn’t have thought any of these were easier for management than simply booting a VM with a bunch of preinstalled tools and a CoW base image, and the CI job settings exposed via something like qemufwcfg or a tiny FS on another virtual device.

@david_chisnall @whitequark the rabbithole of self hosted CI is a nightmare, the reality is there is no secure method, the sandboxing options are all flawed.

If I were to make a professional suggestion it would be to spin up a new temporary VM for each job on
someone else's infrastructure and hope for the best.

Good luck
@Baa @david_chisnall I am very well aware of this and if there was a reasonable way to do this with Forgejo Actions I would've already been doing it

@david_chisnall

  • I'm not using VMs because nested virtualization is awfully slow and I designed the runner system I'm using to run on top of commodity cloud compute
  • even if I were to use VMs anyway (or if I set up the bare metal I'm looking at right now, etc) then this still leaves the problem that forgejo-runner can't spawn a VM per build, only a container per build, meaning malware can persist itself
  • @whitequark my experience with podman so far has been "docker's default is to run code inside the container as the container's own root user and podman's default is to have a UID 1000 inside the container run everything" so 90% of the fixes for Containerfiles and containers I've pulled off the net was just to give it the flag to run "root" on the inside. the "podman-docker" or "docker-podman" package gives you a compatible socket in the right path that docker CLI and tools that speak to the docker socket directly can be happy, which may take you almost all the way to workflows requiring "docker-in-docker"? I hope at least one of these is new & helpful information to you
    @timotimo this seems pretty much completely irrelevant
    @whitequark I was thinking cibuildwheel should be able to run with that, but a second look lead me to the "container-engine" option (or CIBW_CONTAINER_ENGINE) that you can set to "podman". is it not running in practice even though it should work in theory?
    @timotimo that fails with inability to find /dev/net/tun and /dev/fuse (the latter might be fixed by fuse-overlayfs, the former not sure)

    @whitequark I've been able to do this just now:

    podman run --rm -it --security-opt label=disable --user podman quay.io/podman/stable podman run -it --rm --user root registry.fedoraproject.org/fedora-toolbox

    that's an image specifically built to use podman inside podman (or docker I guess?) and I'm running it as user and without --privileged and inside of it is a fedora toolbox and inside the fedora toolbox itself I was able to curl codeberg.org

    This might be a good place to start from. Not sure what exactly makes the error about tun/tap not happen with this image, however

    @timotimo try running cibuildwheel (this is the actual workload that's been failing; it's not my workflow but a friend's so i only have a limited amount of insight into what's it doing)
    pyreading

    Python packages used by Arcalibre

    Codeberg.org
    @whitequark I think I'll have time to look more closely this evening!
    @whitequark @david_chisnall are you on using systemd? If so, are you tailoring your security options for the service to be highly restrictive? If yes then no, I might have some good starting place for those options at the office.
    @c0dec0dec0de @david_chisnall I am using systemd but I don't see how this would help considering the attack surface I'm concerned is "the kernel" and maybe "Podman", not "the Forgejo Actions runner" (which is the service I'd be configuring)
    @whitequark @david_chisnall minimize blast radius for the process tree running Podman. We’re doing it with the Jenkins agent config at work, though admittedly there’s only so much you can do.
    @whitequark @david_chisnall and, yes, we take a rather adversarial mindset with our coworkers. It’s contractually required.
    @c0dec0dec0de @david_chisnall considering that one of the top risks is "one project's workflow compromising other project's releases" this seems like the wrong surface to defend
    @whitequark @david_chisnall it doesn’t slap the whole process tree down unceremoniously on violation (or does it? That would be bad, the Accessibility leg of the CIA triad is pretty much primary in CI, as you say).

    @c0dec0dec0de @david_chisnall here is how I think this works (ignore my earlier posts, I had some invalid assumptions I've since corrected)

    forgejo-action-runner
    L podman
    L stuff inside job 1...
    L malware?
    L podman
    L stuff inside job 2...

    so let's say the malware breaks out of podman. now it runs with fjar permissions. which means that touching job 2's stuff is not a violation of any kind, from the kernel's and systemd's perspective

    @c0dec0dec0de @david_chisnall this is why I think the only actual solution to this is VMs of some sort, either commodity cloud runners spawned on demand, or firecracker or something
    Kata Containers - Open Source Container Runtime Software

    Kata Containers is an open source container runtime, building lightweight virtual machines that seamlessly plug into the containers ecosystem.

    @whitequark @c0dec0dec0de

    It's worth noting that most cloud container systems are also isolated VMs. This is partly for software compatibility (the guest can be a shiny new kernel, the host can be a LTS or CIP release), but mostly because cloud providers don't regard anything other than a VM as a defensible boundary.

    Azure did a bunch of things with nested virtualisation, but they've now, I believe, upstreamed something to Linux that exposes a device compatible with KVM that lets one VM delegate pages to another and gives the abstraction of nested virtualisation where the 'child' is a child in the 'administration is delegated to the parent' sense and not in the 'recursive nested paging' sense.

    @david_chisnall @c0dec0dec0de huh, that's really interesting to hear re: Azure.
    @whitequark @david_chisnall that's kinda questionable right from the start

    I don't agree with a bunch of design choices of it, but builds.sr.ht just gives you VMs and that's a lot better both operationally, and from what you can do on it.
    @ignaloidas @david_chisnall I have 200 repositories on GitHub, many with GHA jobs. I'm not about to redesign the workflows all of them use
    @whitequark @david_chisnall oh, I agree that for migration from GHA it's not viable

    It's just that I find it surprising that the current easiest migration path has zero thoughts about multi-tennancy concerns - I guess because they assumed that the costs are too high for anyone to do multi-tennancy
    @ignaloidas @david_chisnall I think it's more because the resources dedicated to implementing it are incredibly scarce and GHA's internals are hell
    @whitequark @david_chisnall is it really that hard to achieve though? FWIW last time I looked at it, builds.sr.ht just spun up a VM with qemu and SSH'ed into it, it doesn't seem that much of a stretch to have a VM image you launch a VM for every job that contains the whole actions runner and drive it over SSH or w/e from the host
    @ignaloidas @david_chisnall if you wish to backseat configure this: stop. if you wish to actually do something useful here: go and design a service that does this
    @whitequark @david_chisnall tbh I was more wondering on whether there is some technical complexity which made it hard to do or whether it was just lack of time/attention that was the reason. Seems like it's the latter from further replies.

    @ignaloidas @whitequark

    I’d have thought a system using VMs would be less effort to implement because you start with a full-featured environment inside and a default-deny policy outside. But I haven’t looked at their code.

    We set up a cluster of Morello machines that booted from an NFS image (with the local disk available, encrypted with a random key so the next user didn’t see anything you put on it) and ran the GitHub action runner in single-job-accept mode with the next job in the queue to run. Almost all of the complexity came from:

    • Booting the machines, which were indeed as dev boards, which involved talking to the UEFI console over a serial link and providing it with the right sequence of commands.
    • Actually getting the callback from the actions thing on GitHub in such a way that didn’t require opening up inbound ports on the local network (and with GitHub’s weird ad-hoc crypto). We ended up with an Azure Function that put the data in an Azure Message Queue that the cluster head node could poll.
    • Making sure that we could forcibly reboot things if they infinite looped.

    Most of this complexity goes away if you own the dispatcher and you’re not using weird prototype hardware.

    @david_chisnall @ignaloidas this would work fine if I was willing to spend a month or two of my life redesigning the entire Forgejo Actions runner. I think it will probably come to that but I resent the thought so I'm trying to put things together that are almost as good without this unpleasant effort

    @whitequark @ignaloidas

    I’m mostly curious why they seem to have done the thing I’d expect to be harder before the thing I’d expect to be easier. My worry is that it’s because designing for security was not part of their objective (if you don’t care about properly restricting rights and having a defensible security boundary, containers are probably slightly easier to implement), which makes me nervous about the rest of their stack.

    @david_chisnall @ignaloidas oh, you're absolutely correct. Forgejo Actions runner grew out of nektos/act, which was just a way to run a GHA workflow locally (using your local dockerd); then it was integrated into Forgejo essentially without redesign, with "security" being left wholly and entirely to the sysadmin. read this page if you want to see something funny and/or despair-inducing.
    Securing Forgejo Actions Deployments | Forgejo – Beyond coding. We forge.

    @david_chisnall @ignaloidas the fact that Forgejo Actions runner even has the option to "run every job on the host" (where it can trivially steal every credential) is a major failure of process
    @whitequark @david_chisnall @ignaloidas that sounds pretty unproblematic if you're on a small team with a self-hosted forgejo server and just don't bother setting up a Jenkins
    @ratsnakegames @david_chisnall @ignaloidas there are definitely circumstances where it's safe, but I would bet a large amount of money that the majority of Forgejo Actions runner instances are not set up for these circumstances

    @ratsnakegames @david_chisnall @ignaloidas I think complex software like this should offer a secure configuration as a default that is sufficient for almost anyone, rather than an extremely difficult to configure correctly environment paired with a bunch of opt-outs

    (though now that I think more about it, the host mode is kind of necessary on Windows and macOS that don't have useful container replacements)

    @whitequark @david_chisnall @ignaloidas i agree on useful defaults but i also don't like removing options just because they could do harm if enabled in the wrong circumstances.
    @ratsnakegames @david_chisnall @ignaloidas I do like doing that, provided there's an adequate replacement

    @whitequark @ignaloidas

    Well, that’s horrifying. I really hope that this is scoped to their actions reimplementation and the rest of their system isn’t such an absolute train wreck.

    @david_chisnall @ignaloidas from what I've seen the rest of Forgejo is ok. not great, arguably not good in plraces, but definitely ok, enough so that I don't mind running it
    @david_chisnall @ignaloidas git-pages has probably had an order of magnitude more thought put into security than the entire Forgejo Actions implementation top to bottom, which is sad