[Bug 1202821] New: Containers lose permission to /dev/pts after some time
https://bugzilla.suse.com/show_bug.cgi?id=1202821 Bug ID: 1202821 Summary: Containers lose permission to /dev/pts after some time Classification: openSUSE Product: openSUSE Distribution Version: Leap 15.4 Hardware: Other OS: Other Status: NEW Severity: Normal Priority: P5 - None Component: Containers Assignee: containers-bugowner@suse.de Reporter: fvogt@suse.com QA Contact: qa-bugs@suse.de Found By: --- Blocker: --- On openqaworker1, there are four (pretty much identical) containers set up as services which run openQA tests. After some uptime (~1 day?), xterm fails to start due to permission issues. When it happens, those issues can be triggered manually by executing e.g. su -P. After restarting a container, the permission issues disappear for a while. Sometimes a container has a different issue regarding cgroups, where it looks like the assigned cgroup somehow disappeared. Here's some example output showing the symptoms. containers 101 and 104 were recently restarted and work fine. openqaworker1:~ # podman exec -i openqaworker1_container_101 su -P openqaworker1_container:/ # exit openqaworker1:~ # podman exec -i openqaworker1_container_104 su -P openqaworker1_container:/ # exit openqaworker1:~ # podman exec -i openqaworker1_container_102 su -P su: failed to create pseudo-terminal: Operation not permitted openqaworker1:~ # podman exec -i openqaworker1_container_103 su -P Error: exec failed: unable to start container process: error adding pid 10265 to cgroups: failed to write 10265: openat2 /sys/fs/cgroup/unified/machine.slice/libpod-6c5af02df66206caa2b364013a6ef4f8a6add7206beed39d7a2d85bfd0bfc1c0.scope/cgroup.procs: no such file or directory: OCI runtime attempted to invoke a command that was not found I didn't know where to begin looking, so I started at the bottom and used systemtap to trace where openat gets its error code from. I traced the permission issue down to the devices cgroup (v1). Some information about the relevant cgroups: container 101 (restarted, works): openqaworker1:~ # podman inspect openqaworker1_container_101 | grep CgroupPath "CgroupPath": "/machine.slice/libpod-955b615b984df586c92fdc7177ab4a8338bdbad109c3b9fc151ec90e7f420812.scope" openqaworker1:~ # cat /sys/fs/cgroup/devices/machine.slice/libpod-955b615b984df586c92fdc7177ab4a8338bdbad109c3b9fc151ec90e7f420812.scope/devices.list c 10:200 rwm c 5:2 rwm c 5:0 rwm c 1:9 rwm c 1:8 rwm c 1:7 rwm c 1:5 rwm c 1:3 rwm b *:* m c *:* m c 136:* rwm openqaworker1:~ # podman inspect openqaworker1_container_101 | grep -i pid "Pid": 9426, "ConmonPid": 9413, "ConmonPidFile": "/var/run/containers/storage/btrfs-containers/955b615b984df586c92fdc7177ab4a8338bdbad109c3b9fc151ec90e7f420812/userdata/conmon.pid", "PidFile": "", "PidMode": "private", "PidsLimit": 2048, openqaworker1:~ # cat /proc/9426/cgroup 13:blkio:/machine.slice/libpod-955b615b984df586c92fdc7177ab4a8338bdbad109c3b9fc151ec90e7f420812.scope 12:perf_event:/machine.slice/libpod-955b615b984df586c92fdc7177ab4a8338bdbad109c3b9fc151ec90e7f420812.scope 11:devices:/machine.slice/libpod-955b615b984df586c92fdc7177ab4a8338bdbad109c3b9fc151ec90e7f420812.scope 10:pids:/machine.slice/libpod-955b615b984df586c92fdc7177ab4a8338bdbad109c3b9fc151ec90e7f420812.scope 9:memory:/machine.slice/libpod-955b615b984df586c92fdc7177ab4a8338bdbad109c3b9fc151ec90e7f420812.scope 8:rdma:/machine.slice/libpod-955b615b984df586c92fdc7177ab4a8338bdbad109c3b9fc151ec90e7f420812.scope 7:misc:/ 6:cpuset:/machine.slice/libpod-955b615b984df586c92fdc7177ab4a8338bdbad109c3b9fc151ec90e7f420812.scope 5:net_cls,net_prio:/machine.slice/libpod-955b615b984df586c92fdc7177ab4a8338bdbad109c3b9fc151ec90e7f420812.scope 4:freezer:/machine.slice/libpod-955b615b984df586c92fdc7177ab4a8338bdbad109c3b9fc151ec90e7f420812.scope 3:cpu,cpuacct:/machine.slice/libpod-955b615b984df586c92fdc7177ab4a8338bdbad109c3b9fc151ec90e7f420812.scope 2:hugetlb:/machine.slice/libpod-955b615b984df586c92fdc7177ab4a8338bdbad109c3b9fc151ec90e7f420812.scope 1:name=systemd:/machine.slice/libpod-955b615b984df586c92fdc7177ab4a8338bdbad109c3b9fc151ec90e7f420812.scope 0::/machine.slice/libpod-955b615b984df586c92fdc7177ab4a8338bdbad109c3b9fc151ec90e7f420812.scope container 102 (permission issue, device entries missing): openqaworker1:~ # podman inspect openqaworker1_container_102 | grep CgroupPath "CgroupPath": "/machine.slice/libpod-bb7f7bed785fed4e77244d316cea6ee21ba9e6a26b609f02c894430f72beb3eb.scope" openqaworker1:~ # cat /sys/fs/cgroup/devices/machine.slice/libpod-bb7f7bed785fed4e77244d316cea6ee21ba9e6a26b609f02c894430f72beb3eb.scope/devices.list c 1:3 rwm c 1:5 rwm c 1:7 rwm c 1:8 rwm c 1:9 rwm c 5:0 rwm c 5:2 rwm c 10:200 rwm openqaworker1:~ # cat /proc/17525/cgroup 13:blkio:/machine.slice/libpod-bb7f7bed785fed4e77244d316cea6ee21ba9e6a26b609f02c894430f72beb3eb.scope 12:perf_event:/machine.slice/libpod-bb7f7bed785fed4e77244d316cea6ee21ba9e6a26b609f02c894430f72beb3eb.scope 11:devices:/machine.slice/libpod-bb7f7bed785fed4e77244d316cea6ee21ba9e6a26b609f02c894430f72beb3eb.scope 10:pids:/machine.slice/libpod-bb7f7bed785fed4e77244d316cea6ee21ba9e6a26b609f02c894430f72beb3eb.scope 9:memory:/machine.slice/libpod-bb7f7bed785fed4e77244d316cea6ee21ba9e6a26b609f02c894430f72beb3eb.scope 8:rdma:/machine.slice/libpod-bb7f7bed785fed4e77244d316cea6ee21ba9e6a26b609f02c894430f72beb3eb.scope 7:misc:/ 6:cpuset:/machine.slice/libpod-bb7f7bed785fed4e77244d316cea6ee21ba9e6a26b609f02c894430f72beb3eb.scope 5:net_cls,net_prio:/machine.slice/libpod-bb7f7bed785fed4e77244d316cea6ee21ba9e6a26b609f02c894430f72beb3eb.scope 4:freezer:/machine.slice/libpod-bb7f7bed785fed4e77244d316cea6ee21ba9e6a26b609f02c894430f72beb3eb.scope 3:cpu,cpuacct:/machine.slice/libpod-bb7f7bed785fed4e77244d316cea6ee21ba9e6a26b609f02c894430f72beb3eb.scope 2:hugetlb:/machine.slice/libpod-bb7f7bed785fed4e77244d316cea6ee21ba9e6a26b609f02c894430f72beb3eb.scope 1:name=systemd:/machine.slice/libpod-bb7f7bed785fed4e77244d316cea6ee21ba9e6a26b609f02c894430f72beb3eb.scope 0::/machine.slice/libpod-bb7f7bed785fed4e77244d316cea6ee21ba9e6a26b609f02c894430f72beb3eb.scope container 103 (cgroup missing?): openqaworker1:~ # podman inspect openqaworker1_container_103 | grep -i pid "Pid": 18167, "ConmonPid": 18154, "ConmonPidFile": "/var/run/containers/storage/btrfs-containers/6c5af02df66206caa2b364013a6ef4f8a6add7206beed39d7a2d85bfd0bfc1c0/userdata/conmon.pid", "PidFile": "", "PidMode": "private", "PidsLimit": 2048, openqaworker1:~ # cat /proc/18167/cgroup 13:blkio:/machine.slice/libpod-6c5af02df66206caa2b364013a6ef4f8a6add7206beed39d7a2d85bfd0bfc1c0.scope 12:perf_event:/machine.slice/libpod-6c5af02df66206caa2b364013a6ef4f8a6add7206beed39d7a2d85bfd0bfc1c0.scope 11:devices:/machine.slice/libpod-6c5af02df66206caa2b364013a6ef4f8a6add7206beed39d7a2d85bfd0bfc1c0.scope 10:pids:/machine.slice/libpod-6c5af02df66206caa2b364013a6ef4f8a6add7206beed39d7a2d85bfd0bfc1c0.scope 9:memory:/machine.slice/libpod-6c5af02df66206caa2b364013a6ef4f8a6add7206beed39d7a2d85bfd0bfc1c0.scope 8:rdma:/machine.slice/libpod-6c5af02df66206caa2b364013a6ef4f8a6add7206beed39d7a2d85bfd0bfc1c0.scope 7:misc:/ 6:cpuset:/machine.slice/libpod-6c5af02df66206caa2b364013a6ef4f8a6add7206beed39d7a2d85bfd0bfc1c0.scope 5:net_cls,net_prio:/machine.slice/libpod-6c5af02df66206caa2b364013a6ef4f8a6add7206beed39d7a2d85bfd0bfc1c0.scope 4:freezer:/machine.slice/libpod-6c5af02df66206caa2b364013a6ef4f8a6add7206beed39d7a2d85bfd0bfc1c0.scope 3:cpu,cpuacct:/machine.slice/libpod-6c5af02df66206caa2b364013a6ef4f8a6add7206beed39d7a2d85bfd0bfc1c0.scope 2:hugetlb:/machine.slice/libpod-6c5af02df66206caa2b364013a6ef4f8a6add7206beed39d7a2d85bfd0bfc1c0.scope 1:name=systemd:/system.slice/container-openqaworker1_container_103.service 0::/system.slice/container-openqaworker1_container_103.service It's visible that in the case of container 102, the device cgroup entries "b *:* m" and "c *:* m" got removed somehow and for container 103, the cgroup changed completely (system.slice/*.service instead of machine.slice/libpod-*). bug 1178775 sounds similar. I guess systemd is somehow interfering with podman? -- You are receiving this mail because: You are on the CC list for the bug.
https://bugzilla.suse.com/show_bug.cgi?id=1202821 Fabian Vogt <fvogt@suse.com> changed: What |Removed |Added ---------------------------------------------------------------------------- CC| |mkoutny@suse.com Flags| |needinfo?(mkoutny@suse.com) -- You are receiving this mail because: You are on the CC list for the bug.
https://bugzilla.suse.com/show_bug.cgi?id=1202821 https://bugzilla.suse.com/show_bug.cgi?id=1202821#c1 Michal Koutn� <mkoutny@suse.com> changed: What |Removed |Added ---------------------------------------------------------------------------- Flags|needinfo?(mkoutny@suse.com) | --- Comment #1 from Michal Koutn� <mkoutny@suse.com> --- (In reply to Fabian Vogt from comment #0)
On openqaworker1, there are four (pretty much identical) containers set up as services which run openQA tests.
What does it mean containers as services? (I assume it's from project [1], right?) Q1) Is there difference how the containers are started among the four? (I.e. from user session vs from a systemd service.)
Sometimes a container has a different issue regarding cgroups, where it looks like the assigned cgroup somehow disappeared.
What do you mean here? Is it the system.slice vs machine.slice discrepancy? Or anything else?
openqaworker1:~ # podman exec -i openqaworker1_container_102 su -P su: failed to create pseudo-terminal: Operation not permitted
This looks suspiciously similar to the bug 1178775, if it weren't Leap 15.4 with systemd v249 (where this should be fixed).
openqaworker1:~ # podman exec -i openqaworker1_container_103 su -P Error: exec failed: unable to start container process: error adding pid 10265 to cgroups: failed to write 10265: openat2
If it was process 10625 terminated earlier than it could have been migrated to scope cgroup, we'd get ESRCH. This is ENOENT, so the cgroup doesn't exist. So the *c1c0.scope doesn't exist from systemd's PoV.
It's visible that in the case of container 102, the device cgroup entries "b *:* m" and "c *:* m" got removed somehow and for container 103, the cgroup changed completely (system.slice/*.service instead of machine.slice/libpod-*).
Q2) What cgroup "driver" does podman use for these containers? (cgroupManager in podman lingo, cgroupfs vs systemd.)
bug 1178775 sounds similar. I guess systemd is somehow interfering with podman?
Or podman is interfering with systemd. :-) 1) Q3) Would it be too bold to ask to switch the host to the unified mode (system.cgroup_unified_hierarchy=1 to kernel cmdline)? (Issues with device controller and maintaining parallel hierarchies with systemd (and container runtime) would likely be gone with just the unified hierarchy.) [1] https://github.com/openSUSE/containers-systemd. -- You are receiving this mail because: You are on the CC list for the bug.
https://bugzilla.suse.com/show_bug.cgi?id=1202821 https://bugzilla.suse.com/show_bug.cgi?id=1202821#c2 --- Comment #2 from Fabian Vogt <fvogt@suse.com> --- (In reply to Michal Koutn� from comment #1)
(In reply to Fabian Vogt from comment #0)
On openqaworker1, there are four (pretty much identical) containers set up as services which run openQA tests.
What does it mean containers as services? (I assume it's from project [1], right?)
"podman generate-systemd" in this case.
Q1) Is there difference how the containers are started among the four? (I.e. from user session vs from a systemd service.)
All of them use practically identical systemd units and FWICT it's random which of the containers fails in which way...
Sometimes a container has a different issue regarding cgroups, where it looks like the assigned cgroup somehow disappeared.
What do you mean here? Is it the system.slice vs machine.slice discrepancy? Or anything else?
That the cgroup assigned to the container (resp. the other way around) disappeared, resulting in ENOENT.
openqaworker1:~ # podman exec -i openqaworker1_container_102 su -P su: failed to create pseudo-terminal: Operation not permitted
This looks suspiciously similar to the bug 1178775, if it weren't Leap 15.4 with systemd v249 (where this should be fixed).
openqaworker1:~ # podman exec -i openqaworker1_container_103 su -P Error: exec failed: unable to start container process: error adding pid 10265 to cgroups: failed to write 10265: openat2
If it was process 10625 terminated earlier than it could have been migrated to scope cgroup, we'd get ESRCH. This is ENOENT, so the cgroup doesn't exist. So the *c1c0.scope doesn't exist from systemd's PoV.
It's visible that in the case of container 102, the device cgroup entries "b *:* m" and "c *:* m" got removed somehow and for container 103, the cgroup changed completely (system.slice/*.service instead of machine.slice/libpod-*).
Q2) What cgroup "driver" does podman use for these containers? (cgroupManager in podman lingo, cgroupfs vs systemd.)
All of them have "CgroupManager": "systemd"
bug 1178775 sounds similar. I guess systemd is somehow interfering with podman?
Or podman is interfering with systemd. :-)
1)
Q3) Would it be too bold to ask to switch the host to the unified mode (system.cgroup_unified_hierarchy=1 to kernel cmdline)? (Issues with device controller and maintaining parallel hierarchies with systemd (and container runtime) would likely be gone with just the unified hierarchy.)
I tried that (without typo, "systemd.unified_cgroup_hierarchy=1"). The kernel parameter is used, but it looks like cgroupv1 is still used by systemd, at least for devices. Is that expected? openqaworker1:~ # cat /proc/cmdline BOOT_IMAGE=/boot/vmlinuz-5.14.21-150400.24.18-default root=UUID=ff1922d2-d2e4-4860-9634-acac681dd0f9 resume=/dev/md0 nospec console=tty0 console=ttyS1,115200n resume=/dev/disk/by-uuid/10264dd9-ba3d-4ef1-8db9-0d74df0d43f1 splash=silent quiet showopts nospec spectre_v2=off pti=off systemd.cgroup_unified_hierarchy=1 openqaworker1:~ # ls /sys/fs/cgroup/devices/*.slice/ /sys/fs/cgroup/devices/machine.slice/: cgroup.clone_children cgroup.procs devices.allow devices.deny devices.list notify_on_release tasks /sys/fs/cgroup/devices/openqa.slice/: cgroup.clone_children cgroup.procs devices.allow devices.deny devices.list notify_on_release tasks /sys/fs/cgroup/devices/system.slice/: auditd.service cron.service mdmonitor.service postfix.service systemd-journald.service wickedd-auto4.service boot-grub2-i386\x2dpc.mount dbus.service notify_on_release rebootmgr.service systemd-logind.service wickedd-dhcp4.service boot-grub2-x86_64\x2defi.mount devices.allow nscd.service root.mount systemd-udevd.service wickedd-dhcp6.service cgroup.clone_children devices.deny openqa-worker-cacheservice-minion.service rsyslog.service system-getty.slice wickedd-nanny.service cgroup.procs devices.list openqa-worker-cacheservice.service smartd.service system-modprobe.slice wickedd.service chronyd.service firewalld.service opt.mount srv.mount system-serial\x2dgetty.slice \x2esnapshots.mount container-openqaworker1_container_101.service haveged.service os-autoinst-openvswitch.service sshd.service tasks container-openqaworker1_container_102.service home.mount ovsdb-server.service sysroot-etc.mount tmp.mount container-openqaworker1_container_103.service irqbalance.service ovs-vswitchd.service sysroot.mount usr-local.mount container-openqaworker1_container_104.service mcelog.service polkit.service sysroot-var.mount var-lib-openqa.mount /sys/fs/cgroup/devices/user.slice/: cgroup.clone_children cgroup.procs devices.allow devices.deny devices.list notify_on_release tasks Also, this is not the default, so IMO even if it works with unified it should still be fixed with cgv1 or the default changed.
-- You are receiving this mail because: You are on the CC list for the bug.
https://bugzilla.suse.com/show_bug.cgi?id=1202821 https://bugzilla.suse.com/show_bug.cgi?id=1202821#c3 --- Comment #3 from Michal Koutn� <mkoutny@suse.com> --- (In reply to Fabian Vogt from comment #2)
All of them use practically identical systemd units and FWICT it's random which of the containers fails in which way...
To deal with the randomness. I'd suggest enabling debug logging of systemd and capture the journal logs when first issue occurs (few periods back, where I understand period~a single container lifetime). Could you collect such data? (Or possibly just share journal data that you have for the current instance (without debug level).)
I tried that (without typo, "systemd.unified_cgroup_hierarchy=1").
Sorry about that, it hits me all the time. `man systemd` is correct.
The kernel parameter is used, but it looks like cgroupv1 is still used by systemd, at least for devices. Is that expected?
No, that's suspicious. The device controller functionality is replaced with BPF programs with unified mode. (Isn't it still the typo? ':-)) What does `grep cgroup /proc/mounts` say on such a system?
Also, this is not the default, so IMO even if it works with unified it should still be fixed with cgv1 or the default changed.
Understood. -- You are receiving this mail because: You are on the CC list for the bug.
https://bugzilla.suse.com/show_bug.cgi?id=1202821 https://bugzilla.suse.com/show_bug.cgi?id=1202821#c4 --- Comment #4 from Fabian Vogt <fvogt@suse.com> --- (In reply to Michal Koutn�� from comment #3)
(In reply to Fabian Vogt from comment #2)
All of them use practically identical systemd units and FWICT it's random which of the containers fails in which way...
To deal with the randomness. I'd suggest enabling debug logging of systemd and capture the journal logs when first issue occurs (few periods back, where I understand period~a single container lifetime). Could you collect such data? (Or possibly just share journal data that you have for the current instance (without debug level).)
I can try, but it's not easy to tell when it breaks as we only know that it's broken when a test starts (a couple times a day). So there's always a window of a few hours. We could try to set up a "podman exec ... su -P" loop or something. Should we focus on that or testing with cgroups v2?
I tried that (without typo, "systemd.unified_cgroup_hierarchy=1").
Sorry about that, it hits me all the time. `man systemd` is correct.
The kernel parameter is used, but it looks like cgroupv1 is still used by systemd, at least for devices. Is that expected?
No, that's suspicious. The device controller functionality is replaced with BPF programs with unified mode. (Isn't it still the typo? ':-))
I hope my last comment shows the correct name in /proc/cmdline...
What does `grep cgroup /proc/mounts` say on such a system?
Both hierarchies are mounted: openqaworker1:~ # findmnt -R /sys TARGET SOURCE FSTYPE OPTIONS /sys sysfs sysfs rw,nosuid,nodev,noexec,relatime ������/sys/kernel/security securityfs securityfs rw,nosuid,nodev,noexec,relatime ������/sys/fs/cgroup tmpfs tmpfs ro,nosuid,nodev,noexec,size=4096k,nr_inodes=1024,mode=755,inode64 ��� ������/sys/fs/cgroup/unified cgroup2 cgroup2 rw,nosuid,nodev,noexec,relatime,nsdelegate ��� ������/sys/fs/cgroup/systemd cgroup cgroup rw,nosuid,nodev,noexec,relatime,xattr,name=systemd ��� ������/sys/fs/cgroup/cpuset cgroup cgroup rw,nosuid,nodev,noexec,relatime,cpuset ��� ������/sys/fs/cgroup/cpu,cpuacct cgroup cgroup rw,nosuid,nodev,noexec,relatime,cpu,cpuacct ��� ������/sys/fs/cgroup/freezer cgroup cgroup rw,nosuid,nodev,noexec,relatime,freezer ��� ������/sys/fs/cgroup/blkio cgroup cgroup rw,nosuid,nodev,noexec,relatime,blkio ��� ������/sys/fs/cgroup/memory cgroup cgroup rw,nosuid,nodev,noexec,relatime,memory ��� ������/sys/fs/cgroup/pids cgroup cgroup rw,nosuid,nodev,noexec,relatime,pids ��� ������/sys/fs/cgroup/net_cls,net_prio cgroup cgroup rw,nosuid,nodev,noexec,relatime,net_cls,net_prio ��� ������/sys/fs/cgroup/perf_event cgroup cgroup rw,nosuid,nodev,noexec,relatime,perf_event ��� ������/sys/fs/cgroup/hugetlb cgroup cgroup rw,nosuid,nodev,noexec,relatime,hugetlb ��� ������/sys/fs/cgroup/misc cgroup cgroup rw,nosuid,nodev,noexec,relatime,misc ��� ������/sys/fs/cgroup/rdma cgroup cgroup rw,nosuid,nodev,noexec,relatime,rdma ��� ������/sys/fs/cgroup/devices cgroup cgroup rw,nosuid,nodev,noexec,relatime,devices ������/sys/fs/pstore pstore pstore rw,nosuid,nodev,noexec,relatime ������/sys/fs/bpf none bpf rw,nosuid,nodev,noexec,relatime,mode=700 ������/sys/kernel/tracing tracefs tracefs rw,nosuid,nodev,noexec,relatime ������/sys/kernel/debug debugfs debugfs rw,nosuid,nodev,noexec,relatime ��� ������/sys/kernel/debug/tracing tracefs tracefs rw,nosuid,nodev,noexec,relatime ������/sys/fs/fuse/connections fusectl fusectl rw,nosuid,nodev,noexec,relatime ������/sys/kernel/config configfs configfs rw,nosuid,nodev,noexec,relatime
Also, this is not the default, so IMO even if it works with unified it should still be fixed with cgv1 or the default changed.
Understood.
-- You are receiving this mail because: You are on the CC list for the bug.
https://bugzilla.suse.com/show_bug.cgi?id=1202821 https://bugzilla.suse.com/show_bug.cgi?id=1202821#c5 --- Comment #5 from Fabian Vogt <fvogt@suse.com> --- I checked the man page. It's systemd.unified_cgroup_hierarchy, not systemd.cgroup_unified_hierarchy... I'll change that. -- You are receiving this mail because: You are on the CC list for the bug.
https://bugzilla.suse.com/show_bug.cgi?id=1202821 https://bugzilla.suse.com/show_bug.cgi?id=1202821#c7 --- Comment #7 from Michal Koutn� <mkoutny@suse.com> --- (In reply to Fabian Vogt from comment #4)
Should we focus on that or testing with cgroups v2?
v2 should be a workaround if you need the worker up 'n running quickly (hopefully). For SP4 when no v2-specific feature [1] is required, we should fix that (as you wrote). So tracking it down with a reproducer is still helpful. [1] FTR, I'd count unprivileged containers among those too (have the source services User!=root?). -- You are receiving this mail because: You are on the CC list for the bug.
https://bugzilla.suse.com/show_bug.cgi?id=1202821 https://bugzilla.suse.com/show_bug.cgi?id=1202821#c8 Michal Koutn� <mkoutny@suse.com> changed: What |Removed |Added ---------------------------------------------------------------------------- Flags| |needinfo?(dennis@glindhart. | |dk) --- Comment #8 from Michal Koutn� <mkoutny@suse.com> --- (In reply to Dennis Glindhart from comment #6)
I had a similar problem with podman containers started via systemd-services after installation of some newer version of systemd (in Tumbleweed).
Tumbleweed uses v2 by default. Do you override this default? (Otherwise device controller hierarchy would not exist at all.) -- You are receiving this mail because: You are on the CC list for the bug.
https://bugzilla.suse.com/show_bug.cgi?id=1202821 https://bugzilla.suse.com/show_bug.cgi?id=1202821#c10 --- Comment #10 from Michal Koutn� <mkoutny@suse.com> --- Dennis, I'd suggest filing another bug for TW with details (what issue you saw, what service file). TY (Closing as dup never hurts.) -- You are receiving this mail because: You are on the CC list for the bug.
https://bugzilla.suse.com/show_bug.cgi?id=1202821 https://bugzilla.suse.com/show_bug.cgi?id=1202821#c12 --- Comment #12 from Fabian Vogt <fvogt@suse.com> --- (In reply to Dennis Glindhart from comment #6)
I had a similar problem with podman containers started via systemd-services after installation of some newer version of systemd (in Tumbleweed).
I found I could reproduce the problem executing *systemctl daemon-reload* - after which the problem would occur. I guess some cgroups where bound to the systemd daemon somehow.
Can that help getting a reliable reproduce?
Yep, even with unified cgroups v2! openqaworker1:~ # podman exec -i openqaworker1_container_102 su -P openqaworker1_container:/ # exit openqaworker1:~ # systemctl daemon-reload openqaworker1:~ # podman exec -i openqaworker1_container_102 su -P su: failed to create pseudo-terminal: Operation not permitted openqaworker1:~ # (In reply to Michal Koutn� from comment #7)
(In reply to Fabian Vogt from comment #4)
Should we focus on that or testing with cgroups v2?
v2 should be a workaround if you need the worker up 'n running quickly (hopefully).
For SP4 when no v2-specific feature [1] is required, we should fix that (as you wrote). So tracking it down with a reproducer is still helpful.
[1] FTR, I'd count unprivileged containers among those too (have the source services User!=root?).
Nope, started as root. The process inside the container runs as non-root though. -- You are receiving this mail because: You are on the CC list for the bug.
https://bugzilla.suse.com/show_bug.cgi?id=1202821 https://bugzilla.suse.com/show_bug.cgi?id=1202821#c13 Michal Koutn� <mkoutny@suse.com> changed: What |Removed |Added ---------------------------------------------------------------------------- CC| |systemd-maintainers@suse.de --- Comment #13 from Michal Koutn� <mkoutny@suse.com> --- v2 case: (I looked at the affected worker machine) strace of `su -P`:
19830 ioctl(3, TIOCSPTLCK, [0] <unfinished ...> 19830 <... ioctl resumed>) = 0 19830 ioctl(3, TCGETS <unfinished ...> 19830 <... ioctl resumed>, {B38400 opost isig icanon echo ...}) = 0 19830 ioctl(3, TIOCGPTN <unfinished ...> 19830 <... ioctl resumed>, [4]) = 0 19830 stat("/dev/pts/4", <unfinished ...> 19830 <... stat resumed>{st_mode=S_IFCHR|0620, st_rdev=makedev(0x88, 0x4), ...}) = 0 19830 openat(AT_FDCWD, "/dev/pts/4", O_RDWR|O_NOCTTY <unfinished ...> 19830 <... openat resumed>) = -1 EPERM (Operation not permitted)
this command is run in the context of container .scope unit:
/machine.slice/libpod-bb7f7bed785fed4e77244d316cea6ee21ba9e6a26bb609f02c894430f72beb3eb.scope
That scope among other specifies:
...c894430f72beb3eb.scope.d/50-DeviceAllow.conf [Scope] DeviceAllow= DeviceAllow=/dev/char/10:200 rwm DeviceAllow=/dev/char/5:2 rwm DeviceAllow=/dev/char/5:0 rwm DeviceAllow=/dev/char/1:9 rwm DeviceAllow=/dev/char/1:8 rwm DeviceAllow=/dev/char/1:7 rwm DeviceAllow=/dev/char/1:5 rwm DeviceAllow=/dev/char/1:3 rwm
and
...c894430f72beb3eb.scope.d/50-DevicePolicy.conf # /run/systemd/transient/libpod-bb7f7bed785fed4e77244d316cea6ee21ba9e6a26b609f0> # This is a drop-in unit file extension, created via "systemctl set-property" # or an equivalent operation. Do not edit. [Scope] DevicePolicy=strict
IOW, the unit is configured (by podman [1]) in such a way that it allows only listed devices, 136:4 (/dev/pts4) is not among them. The bug here is rather inverse, the BPF rules are not properly applied until `systemctl daemon-reload` is invoked. (I guess it might be related to the fact that .scope creation is run "concurrently" with ExecStart= of the service.) [1] The comment about `systemctl set-property` is slightly misleading as it means the properties were defined via DBus API. v1 case: I believe, it's similar (wrt device access, not non-existent cgroup). The device controller strict rules aren't applied until something causes systemd to re-realize cgroup settings (like daemon-reload) and then `su -P` fails. --- So, you (containers/openqa) may want to check why libpod scopes have strict device policy and me (systemd, +cc systemd-maintainers) may want to check why device rules are not properly applied. -- You are receiving this mail because: You are on the CC list for the bug.
https://bugzilla.suse.com/show_bug.cgi?id=1202821 https://bugzilla.suse.com/show_bug.cgi?id=1202821#c14 Michal Koutn� <mkoutny@suse.com> changed: What |Removed |Added ---------------------------------------------------------------------------- CC| |fvogt@suse.com Flags| |needinfo?(fvogt@suse.com) --- Comment #14 from Michal Koutn� <mkoutny@suse.com> --- When the system is the state that allows `podman exec -i $cont su -P`, could you please collect `systemd-analyze dump`? (I'm interested in the section of respective lipbod-*.scope, machine.slice and -.slice.) -- You are receiving this mail because: You are on the CC list for the bug.
https://bugzilla.suse.com/show_bug.cgi?id=1202821 https://bugzilla.suse.com/show_bug.cgi?id=1202821#c15 Fabian Vogt <fvogt@suse.com> changed: What |Removed |Added ---------------------------------------------------------------------------- Flags|needinfo?(fvogt@suse.com) | --- Comment #15 from Fabian Vogt <fvogt@suse.com> --- (In reply to Michal Koutn� from comment #13)
v2 case:
(I looked at the affected worker machine)
strace of `su -P`:
19830 ioctl(3, TIOCSPTLCK, [0] <unfinished ...> 19830 <... ioctl resumed>) = 0 19830 ioctl(3, TCGETS <unfinished ...> 19830 <... ioctl resumed>, {B38400 opost isig icanon echo ...}) = 0 19830 ioctl(3, TIOCGPTN <unfinished ...> 19830 <... ioctl resumed>, [4]) = 0 19830 stat("/dev/pts/4", <unfinished ...> 19830 <... stat resumed>{st_mode=S_IFCHR|0620, st_rdev=makedev(0x88, 0x4), ...}) = 0 19830 openat(AT_FDCWD, "/dev/pts/4", O_RDWR|O_NOCTTY <unfinished ...> 19830 <... openat resumed>) = -1 EPERM (Operation not permitted)
this command is run in the context of container .scope unit:
/machine.slice/libpod-bb7f7bed785fed4e77244d316cea6ee21ba9e6a26bb609f02c894430f72beb3eb.scope
That scope among other specifies:
...c894430f72beb3eb.scope.d/50-DeviceAllow.conf [Scope] DeviceAllow= DeviceAllow=/dev/char/10:200 rwm DeviceAllow=/dev/char/5:2 rwm DeviceAllow=/dev/char/5:0 rwm DeviceAllow=/dev/char/1:9 rwm DeviceAllow=/dev/char/1:8 rwm DeviceAllow=/dev/char/1:7 rwm DeviceAllow=/dev/char/1:5 rwm DeviceAllow=/dev/char/1:3 rwm
and
...c894430f72beb3eb.scope.d/50-DevicePolicy.conf # /run/systemd/transient/libpod-bb7f7bed785fed4e77244d316cea6ee21ba9e6a26b609f0> # This is a drop-in unit file extension, created via "systemctl set-property" # or an equivalent operation. Do not edit. [Scope] DevicePolicy=strict
IOW, the unit is configured (by podman [1]) in such a way that it allows only listed devices, 136:4 (/dev/pts4) is not among them.
I assume this libpod scope is created by podman's systemd cgroup controller?
The bug here is rather inverse, the BPF rules are not properly applied until `systemctl daemon-reload` is invoked.
Question is whether it's a bug that the scope is too restrictive or that podman's own default is too lenient. I don't know what the default set of allowed device nodes are currently specified at.
(I guess it might be related to the fact that .scope creation is run "concurrently" with ExecStart= of the service.)
The issue is reproducible even when using "podman start" manually instead of "systemctl start container-openqaworker1_container_101.service".
[1] The comment about `systemctl set-property` is slightly misleading as it means the properties were defined via DBus API.
v1 case:
I believe, it's similar (wrt device access, not non-existent cgroup). The device controller strict rules aren't applied until something causes systemd to re-realize cgroup settings (like daemon-reload) and then `su -P` fails.
---
So, you (containers/openqa) may want to check why libpod scopes have strict device policy and me (systemd, +cc systemd-maintainers) may want to check why device rules are not properly applied.
Yep, I'll try to have a look. (In reply to Michal Koutn� from comment #14)
When the system is the state that allows `podman exec -i $cont su -P`, could you please collect `systemd-analyze dump`? (I'm interested in the section of respective lipbod-*.scope, machine.slice and -.slice.)
Attachment incoming. container 101 is working, others are broken. FTR, you can easily get back into the working state with "systemctl restart container-openqaworker1_container_101.service". -- You are receiving this mail because: You are on the CC list for the bug.
https://bugzilla.suse.com/show_bug.cgi?id=1202821 https://bugzilla.suse.com/show_bug.cgi?id=1202821#c16 --- Comment #16 from Fabian Vogt <fvogt@suse.com> --- Created attachment 861197 --> https://bugzilla.suse.com/attachment.cgi?id=861197&action=edit systemd-analyze dump (container 101 working, others broken) -- You are receiving this mail because: You are on the CC list for the bug.
https://bugzilla.suse.com/show_bug.cgi?id=1202821 https://bugzilla.suse.com/show_bug.cgi?id=1202821#c17 Fabian Vogt <fvogt@suse.com> changed: What |Removed |Added ---------------------------------------------------------------------------- Status|NEW |RESOLVED Resolution|--- |FIXED Assignee|containers-bugowner@suse.de |fvogt@suse.com --- Comment #17 from Fabian Vogt <fvogt@suse.com> --- (In reply to Michal Koutn� from comment #13)
So, you (containers/openqa) may want to check why libpod scopes have strict device policy
While searching through podman and runc code to figure that out, I saw that the latest commit in runc was "Merge pull request #3559 from kolyshkin/fix-dev-pts". Indeed, this was a recent regression in runc 1.1.3 which was fixed in 1.1.4, released just five days ago. I updated our runc package to that version and the issue is gone. Submitted to SLE-15:Update and TW. -- You are receiving this mail because: You are on the CC list for the bug.
https://bugzilla.suse.com/show_bug.cgi?id=1202821 https://bugzilla.suse.com/show_bug.cgi?id=1202821#c18 --- Comment #18 from Michal Koutn� <mkoutny@suse.com> --- (In reply to Michal Koutn� from comment #13)
me (systemd, +cc systemd-maintainers) may want to check why device rules are not properly applied.
From the dump:
-> Unit libpod-955b615b984df586c92fdc7177ab4a8338bdbad109c3b9fc151ec90e7f420812.scope: ... CGroup realized: yes CGroup realized mask: cpu cpuset io memory pids bpf-firewall bpf-devices bpf-foreign ... DeviceAllow: /dev/char/10:200 rwm DeviceAllow: /dev/char/5:2 rwm DeviceAllow: /dev/char/5:0 rwm DeviceAllow: /dev/char/1:9 rwm DeviceAllow: /dev/char/1:8 rwm DeviceAllow: /dev/char/1:7 rwm DeviceAllow: /dev/char/1:5 rwm DeviceAllow: /dev/char/1:3 rwm
This shows that systemd realized bpf-devices (i.e. BPF programs attached) and DeviceAllow: list does not list the pts wildcard. /dev/pts devices should not be accessible at this moment. They are allowed though because runc modifies BPF predicates (thanks Fabian for checking with bpftool) but doesn't tell systemd about that. After `systemctl daemon-reload` PID just applies what it was told about, that's correct behavior. -- You are receiving this mail because: You are on the CC list for the bug.
https://bugzilla.suse.com/show_bug.cgi?id=1202821 https://bugzilla.suse.com/show_bug.cgi?id=1202821#c21 --- Comment #21 from Swamp Workflow Management <swamp@suse.de> --- SUSE-RU-2022:3435-1: An update that has one recommended fix can now be installed. Category: recommended (important) Bug References: 1202821 CVE References: JIRA References: Sources used: openSUSE Leap Micro 5.2 (src): runc-1.1.4-150000.33.4 openSUSE Leap 15.4 (src): runc-1.1.4-150000.33.4 openSUSE Leap 15.3 (src): runc-1.1.4-150000.33.4 SUSE Linux Enterprise Module for Containers 15-SP4 (src): runc-1.1.4-150000.33.4 SUSE Linux Enterprise Module for Containers 15-SP3 (src): runc-1.1.4-150000.33.4 SUSE Linux Enterprise Micro 5.2 (src): runc-1.1.4-150000.33.4 SUSE Linux Enterprise Micro 5.1 (src): runc-1.1.4-150000.33.4 SUSE Enterprise Storage 7.1 (src): runc-1.1.4-150000.33.4 SUSE Enterprise Storage 7 (src): runc-1.1.4-150000.33.4 NOTE: This line indicates an update has been released for the listed product(s). At times this might be only a partial fix. If you have questions please reach out to maintenance coordination. -- You are receiving this mail because: You are on the CC list for the bug.
https://bugzilla.suse.com/show_bug.cgi?id=1202821 https://bugzilla.suse.com/show_bug.cgi?id=1202821#c22 --- Comment #22 from Swamp Workflow Management <swamp@suse.de> --- SUSE-RU-2022:3927-1: An update that has two recommended fixes can now be installed. Category: recommended (moderate) Bug References: 1202021,1202821 CVE References: JIRA References: Sources used: openSUSE Leap Micro 5.2 (src): runc-1.1.4-150000.36.1 openSUSE Leap 15.4 (src): runc-1.1.4-150000.36.1 openSUSE Leap 15.3 (src): runc-1.1.4-150000.36.1 SUSE Manager Server 4.1 (src): runc-1.1.4-150000.36.1 SUSE Manager Retail Branch Server 4.1 (src): runc-1.1.4-150000.36.1 SUSE Manager Proxy 4.1 (src): runc-1.1.4-150000.36.1 SUSE Linux Enterprise Server for SAP 15-SP2 (src): runc-1.1.4-150000.36.1 SUSE Linux Enterprise Server for SAP 15-SP1 (src): runc-1.1.4-150000.36.1 SUSE Linux Enterprise Server for SAP 15 (src): runc-1.1.4-150000.36.1 SUSE Linux Enterprise Server 15-SP2-LTSS (src): runc-1.1.4-150000.36.1 SUSE Linux Enterprise Server 15-SP2-BCL (src): runc-1.1.4-150000.36.1 SUSE Linux Enterprise Server 15-SP1-LTSS (src): runc-1.1.4-150000.36.1 SUSE Linux Enterprise Server 15-SP1-BCL (src): runc-1.1.4-150000.36.1 SUSE Linux Enterprise Server 15-LTSS (src): runc-1.1.4-150000.36.1 SUSE Linux Enterprise Module for Containers 15-SP4 (src): runc-1.1.4-150000.36.1 SUSE Linux Enterprise Module for Containers 15-SP3 (src): runc-1.1.4-150000.36.1 SUSE Linux Enterprise Micro 5.3 (src): runc-1.1.4-150000.36.1 SUSE Linux Enterprise Micro 5.2 (src): runc-1.1.4-150000.36.1 SUSE Linux Enterprise Micro 5.1 (src): runc-1.1.4-150000.36.1 SUSE Linux Enterprise High Performance Computing 15-SP2-LTSS (src): runc-1.1.4-150000.36.1 SUSE Linux Enterprise High Performance Computing 15-SP2-ESPOS (src): runc-1.1.4-150000.36.1 SUSE Linux Enterprise High Performance Computing 15-SP1-LTSS (src): runc-1.1.4-150000.36.1 SUSE Linux Enterprise High Performance Computing 15-SP1-ESPOS (src): runc-1.1.4-150000.36.1 SUSE Linux Enterprise High Performance Computing 15-LTSS (src): runc-1.1.4-150000.36.1 SUSE Enterprise Storage 7.1 (src): runc-1.1.4-150000.36.1 SUSE Enterprise Storage 7 (src): runc-1.1.4-150000.36.1 SUSE Enterprise Storage 6 (src): runc-1.1.4-150000.36.1 SUSE CaaS Platform 4.0 (src): runc-1.1.4-150000.36.1 NOTE: This line indicates an update has been released for the listed product(s). At times this might be only a partial fix. If you have questions please reach out to maintenance coordination. -- You are receiving this mail because: You are on the CC list for the bug.
https://bugzilla.suse.com/show_bug.cgi?id=1202821 https://bugzilla.suse.com/show_bug.cgi?id=1202821#c23 --- Comment #23 from Swamp Workflow Management <swamp@suse.de> --- SUSE-RU-2022:3944-1: An update that has two recommended fixes can now be installed. Category: recommended (moderate) Bug References: 1202021,1202821 CVE References: JIRA References: Sources used: SUSE Linux Enterprise Module for Containers 12 (src): runc-1.1.4-16.24.1 NOTE: This line indicates an update has been released for the listed product(s). At times this might be only a partial fix. If you have questions please reach out to maintenance coordination. -- You are receiving this mail because: You are on the CC list for the bug.
participants (1)
-
bugzilla_noreply@suse.com