Bug ID 1186256
Summary qemu-linux-user: hardcoded binfmt handler doesn't play well with containers
Classification openSUSE
Product openSUSE Tumbleweed
Version Current
Hardware Other
OS Other
Status NEW
Severity Normal
Priority P5 - None
Component KVM
Assignee kvm-bugs@suse.de
Reporter martin.wilck@suse.com
QA Contact qa-bugs@suse.de
Found By ---
Blocker ---

Created attachment 849481 [details]
Proposed patch for qemu-binfmt-conf.sh

Since abbc0ce ("qemu-binfmt-conf: use qemu-ARCH-binfmt"),
qemu-binfmt-conf.sh under openSUSE automatically replaces the default qemu
binfmt wrapper "qemu-$ARCH" with "qemu-$ARCH-binfmt" in order to ensure that
argv[0] is preserved; qemu-$ARCH-binfmt is a link to qemu-binfmt, which is just
a simple wrapper that mangles argv to achieve the desired result.
This is a SUSE-specific modification which isn't used upstream.

This approach is inconvenient in some situations. In particular for running
foreign-arch containers, it's useful to use the binfmt_misc "F" ("fix
binary") flag to pre-load the qemu wrapper in the kernel. That way,
foreign-arch containers can be run just like native containers, without
having to bind-mount interpreters into the container. But that's impossible
with the SUSE binfmt wrapper that needs to exec() a different (native)
executable.

In the openSUSE default mode of qemu-binfmt-conf.sh, the user needs to
bind-mount both the -binfmt executable and the actual emulator into the
container:

> $ podman run -it --rm \
>       -v /usr/bin/qemu-ppc64le-binfmt:/usr/bin/qemu-ppc64le-binfmt \
>       -v /usr/bin/qemu-ppc64le:/usr/bin/qemu-ppc64le \
>       ppc64le/busybox uname -m
> ppc64le

Otherwise, he gets

> $ podman run -t --rm ppc64le/busybox uname -m
> standard_init_linux.go:219: exec user process caused: no such file or directory

If qemu-binfmt-conf.sh is used with the --persistent flag, qemu-ppc64le-binfmt
is loaded into the kernel, but qemu-ppc64le must still be bind-mounted.
If qemu-ppc64le was used directly as persistent binfmt_misc helper, it would be
sufficient to run the container as if it was a native one:

> $ podman run -it --rm  ppc64le/busybox uname -m
> ppc64le

I can see why it makes sense to try to preserve argv[0], but for me at least,
the "foreign container" use case is more important. Therefore I'd like to be
able to switch the behavior of the qemu binfmt_misc helper back to the upstream
default.

So far I've worked around the issue by simply using the upstream container
"docker.io/multiarch/qemu-user-static", but I'd like to be able to do this
easily with openSUSE on-board tools.

The attached patch allows the user to override the default "-binfmt" suffix by
running "qemu-binfmt-conf.sh --qemu-suffix ''".

(Note: "qemu-binfmt-conf.sh -F ''" doesn't work, that's a different issue).


You are receiving this mail because: