System Roles support for image mode (bootc) builds
Goal
Image mode, aka. “bootable containers”, aka. “bootc” is an exciting new way to build and deploy operating systems. A bootable container image can be used to install or upgrade a real or virtual machine, similar to container images for applications. This is currently supported for Red Hat Enterprise Linux 9/10 and Fedora/CentOS, but also in other projects like universal-blue.
With system roles being the supported high-level API to set up Fedora/RHEL/CentOS systems, we want to make them compatible with image mode builds. In particular, we need to make them detect the “non-booted” environment and adjust their behaviour to not e.g. try to start systemd units or talk to network services, and defer all of that to the first boot. We also need to add full bootc end-to-end integration tests to ensure this keeps working in the future on all supported platforms.
Build process
This can work in two ways. Both ought to work, and which one you choose depends on your available infrastructure and preferences.
Treat a container build as an Ansible host
Start a container build with e.g.
buildah from --name buildc quay.io/centos-bootc/centos-bootc:stream10
Create an inventory for the buildah connector:
buildc ansible_host=buildc ansible_connection=buildah ansible_become=false ansible_remote_tmp=/tmp
Then run the system-roles playbooks on the “outside” against that inventory.
That matches the spirit of Ansible and is cleaner as Ansible itself and system-roles do not need to be installed into the container. This is the approach outlined in “Building Container Images with Buildah and Ansible” and Ansible and Podman Can Play Together Now and implemented in the ansible-bender proof of concept (⚠️ Warning: currently unmaintained).
Install Ansible and the system roles into the container
The Containerfile
looks roughly like this:
FROM quay.io/centos-bootc/centos-bootc:stream10
RUN dnf -y install ansible-core rhel-system-roles
COPY ./setup.yml .
RUN ansible-playbook setup.yml
Everything happens inside of the image build, and the playbooks run against
localhost
. This could use a multi-stage
build to avoid having
Ansible and the roles in the final image. This is entirely self-contained and
thus works well in automatic container build pipelines.
⚠️ Warning: Unfortunately this is currently broken for many/most roles because
of an Ansible bug: service:
fails in a container build environment.
Once that is fixed, this approach will work well and might often be the
preferred choice.
Status
This effort is tracked in the RHEL-78157 epic. At the time of writing, 15 roles are already supported, the other 22 still need to be updated.
Roles which support image mode builds have the containerbuild
tag, which you
can see in the Ansible Galaxy view (expand the tag list at the top), or in the source code in meta/main.yml.
Note that some roles also have a container
tag, which means that they are
tested and supported in a running system container (i.e. a docker/podman
container with the /sbin/init
entry point, or LXC/nspawn etc.), but not
during a non-booted container build.
Steps for converting a role
Helping out with that effort is very much appreciated! If you are interested in making a particular role compatible with image mode builds, please follow these steps:
-
Clone the role’s upstream git repository. Make sure that its
meta/main.yml
file does not yet have acontainerbuild
tag – if it does, the role was already converted. In that case, please update the status in the epic. -
Familiarize yourself with the purpose of the role, have a look at README.md, and think about whether running the role in a container generally makes sense. That should be the case for most of them, but e.g
storage
is hardware specific and for the most part does not make sense in a container build environment. - Make sure your developer machine can run tests in in general. Do the
integration test setup and also read the following sections about running QEMU and container tests.
E.g. running a QEMU test should work:
tox -e qemu-ansible-core-2.16 -- --image-name centos-9 --log-level=debug -- tests/tests_default.yml
- Do an initial run of the default or other test during a bootc container build, to get a first impression:
LSR_CONTAINER_PROFILE=false LSR_CONTAINER_PRETTY=false tox -e container-ansible-core-2.16 -- --image-name centos-9-bootc tests/tests_default.yml
-
The most common causes of failures are
service_facts:
which just simply doesn’t work in a container, and trying to set thestate:
of a unit inservice:
. The existing PRs linked from RHEL-78157 have plenty of examples what to do with these.The logging role PR is a good example for the standard approach of adding a
__rolename_is_booted
flag to the role variables, and use that to conditionalize operations and tests which can’t work in a container. E.g. the aboveservice: status:
can be fixed withstate: "started"
service_facts:
can be replaced withsystemctl is-enabled
or similar, see e.g. the corresponding mssql fix or firewall fix.Do these “standard recipe” fixes to clear away the easy noise.
-
Create a branch on your fork, and add a temporary commit to run tests on branch pushes, and another commit to enable tests on container builds and in system containers. With that you can iterate on your branch and get testing feedback without creating a lot of PR noise for other developers on the project. Push to your fork, go to the Actions page, and wait for the first test result.
-
As described above, the
container
tag means that the role is supported and works in (booted) system containers. In most cases this is fairly easy to fix, and nice to have, as running tests and iterating is faster, and debugging is also a bit easier. In some cases running in system containers is hard (like in the selinux or podman roles), in that case don’t bother and remove that tag again. -
Go through the other failures. You can download the log archive and/or run the individual tests locally. The following command helps for easier debugging – it keeps the container running for inspection after a failure, and removes containers and temp files from the previous run:
buildah rm --all; rm -rf /tmp/runcontainer.*; LSR_DEBUG=1 LSR_CONTAINER_PROFILE=false LSR_CONTAINER_PRETTY=false tox -e container-ansible-core-2.16 -- --image-name centos-9-bootc tests/tests_default.yml
You can enter the container and debug with
buildah run tests_default bash
. The container name corresponds to the test name; checkbuildah ps
. -
Fix the role and tests until you get a green result. Finally clean up and sort your commits into fix: Skip runtime operations in non-systemd environments, and feat: Support this role in container builds. Any role specific or more intrusive and self-contained change should be in separate commits before these.
-
Add an end-to-end integration test which ensures that running the role during a container build actually works as intended in a QEMU deployment. If there is an existing integration test which has representative complexity and calls the role just once (i.e. tests one scenario), you can convert it like sudo’s bootc e2e test. If there is no existing test, you can also add a specific bootc e2e test like in this demo PR or the postgresql role.
-
To locally run the bootc e2e test, see Image mode testing tox-lsr docs.
-
Push the e2e test to your branch, iterate until green.
-
Send a PR, link it from the Jira epic, get it landed, update the list in the Jira epic again.
- Celebrate 🎉 and brag about your contribution!