System Roles support for image mode (bootc) builds

Goal

Image mode, aka. “bootable containers”, aka. “bootc” is an exciting new way to build and deploy operating systems. A bootable container image can be used to install or upgrade a real or virtual machine, similar to container images for applications. This is currently supported for Red Hat Enterprise Linux 9/10 and Fedora/CentOS, but also in other projects like universal-blue.

With system roles being the supported high-level API to set up Fedora/RHEL/CentOS systems, we want to make them compatible with image mode builds. In particular, we need to make them detect the “non-booted” environment and adjust their behaviour to not e.g. try to start systemd units or talk to network services, and defer all of that to the first boot. We also need to add full bootc end-to-end integration tests to ensure this keeps working in the future on all supported platforms.

Build process

This can work in two ways. Both ought to work, and which one you choose depends on your available infrastructure and preferences.

Treat a container build as an Ansible host

Start a container build with e.g.

buildah from --name buildc quay.io/centos-bootc/centos-bootc:stream10

Create an inventory for the buildah connector:

buildc ansible_host=buildc ansible_connection=buildah ansible_become=false ansible_remote_tmp=/tmp

Then run the system-roles playbooks on the “outside” against that inventory.

That matches the spirit of Ansible and is cleaner as Ansible itself and system-roles do not need to be installed into the container. This is the approach outlined in “Building Container Images with Buildah and Ansible” and Ansible and Podman Can Play Together Now and implemented in the ansible-bender proof of concept (⚠️ Warning: currently unmaintained).

Install Ansible and the system roles into the container

The Containerfile looks roughly like this:

FROM quay.io/centos-bootc/centos-bootc:stream10
RUN dnf -y install ansible-core rhel-system-roles
COPY ./setup.yml .
RUN ansible-playbook setup.yml

Everything happens inside of the image build, and the playbooks run against localhost. This could use a multi-stage build to avoid having Ansible and the roles in the final image. This is entirely self-contained and thus works well in automatic container build pipelines.

⚠️ Warning: Unfortunately this is currently broken for many/most roles because of an Ansible bug: service: fails in a container build environment. Once that is fixed, this approach will work well and might often be the preferred choice.

Status

This effort is tracked in the RHEL-78157 epic. At the time of writing, 15 roles are already supported, the other 22 still need to be updated.

Roles which support image mode builds have the containerbuild tag, which you can see in the Ansible Galaxy view (expand the tag list at the top), or in the source code in meta/main.yml.

Note that some roles also have a container tag, which means that they are tested and supported in a running system container (i.e. a docker/podman container with the /sbin/init entry point, or LXC/nspawn etc.), but not during a non-booted container build.

Steps for converting a role

Helping out with that effort is very much appreciated! If you are interested in making a particular role compatible with image mode builds, please follow these steps:

  1. Clone the role’s upstream git repository. Make sure that its meta/main.yml file does not yet have a containerbuild tag – if it does, the role was already converted. In that case, please update the status in the epic.

  2. Familiarize yourself with the purpose of the role, have a look at README.md, and think about whether running the role in a container generally makes sense. That should be the case for most of them, but e.g storage is hardware specific and for the most part does not make sense in a container build environment.

  3. Make sure your developer machine can run tests in in general. Do the integration test setup and also read the following sections about running QEMU and container tests. E.g. running a QEMU test should work:
    tox -e qemu-ansible-core-2.16 -- --image-name centos-9 --log-level=debug -- tests/tests_default.yml
    
  4. Do an initial run of the default or other test during a bootc container build, to get a first impression:
    LSR_CONTAINER_PROFILE=false LSR_CONTAINER_PRETTY=false tox -e container-ansible-core-2.16 -- --image-name centos-9-bootc tests/tests_default.yml
    
  5. The most common causes of failures are service_facts: which just simply doesn’t work in a container, and trying to set the state: of a unit in service:. The existing PRs linked from RHEL-78157 have plenty of examples what to do with these.

    The logging role PR is a good example for the standard approach of adding a __rolename_is_booted flag to the role variables, and use that to conditionalize operations and tests which can’t work in a container. E.g. the above service: status: can be fixed with

    state: "started"
    

    service_facts: can be replaced with systemctl is-enabled or similar, see e.g. the corresponding mssql fix or firewall fix.

    Do these “standard recipe” fixes to clear away the easy noise.

  6. Create a branch on your fork, and add a temporary commit to run tests on branch pushes, and another commit to enable tests on container builds and in system containers. With that you can iterate on your branch and get testing feedback without creating a lot of PR noise for other developers on the project. Push to your fork, go to the Actions page, and wait for the first test result.

  7. As described above, the container tag means that the role is supported and works in (booted) system containers. In most cases this is fairly easy to fix, and nice to have, as running tests and iterating is faster, and debugging is also a bit easier. In some cases running in system containers is hard (like in the selinux or podman roles), in that case don’t bother and remove that tag again.

  8. Go through the other failures. You can download the log archive and/or run the individual tests locally. The following command helps for easier debugging – it keeps the container running for inspection after a failure, and removes containers and temp files from the previous run:

    buildah rm --all; rm -rf /tmp/runcontainer.*; LSR_DEBUG=1 LSR_CONTAINER_PROFILE=false LSR_CONTAINER_PRETTY=false tox -e container-ansible-core-2.16 -- --image-name centos-9-bootc tests/tests_default.yml
    

    You can enter the container and debug with buildah run tests_default bash. The container name corresponds to the test name; check buildah ps.

  9. Fix the role and tests until you get a green result. Finally clean up and sort your commits into fix: Skip runtime operations in non-systemd environments, and feat: Support this role in container builds. Any role specific or more intrusive and self-contained change should be in separate commits before these.

  10. Add an end-to-end integration test which ensures that running the role during a container build actually works as intended in a QEMU deployment. If there is an existing integration test which has representative complexity and calls the role just once (i.e. tests one scenario), you can convert it like sudo’s bootc e2e test. If there is no existing test, you can also add a specific bootc e2e test like in this demo PR or the postgresql role.

  11. To locally run the bootc e2e test, see Image mode testing tox-lsr docs.

  12. Push the e2e test to your branch, iterate until green.

  13. Send a PR, link it from the Jira epic, get it landed, update the list in the Jira epic again.

  14. Celebrate 🎉 and brag about your contribution!