[Bug 1033098] New: systemctl start instance service failed every time
![](https://seccdn.libravatar.org/avatar/3035b38ff33cf86f480bb169b8500b80.jpg?s=120&d=mm&r=g)
http://bugzilla.suse.com/show_bug.cgi?id=1033098 Bug ID: 1033098 Summary: systemctl start instance service failed every time Classification: openSUSE Product: openSUSE Distribution Version: Leap 42.1 Hardware: Other OS: Other Status: NEW Severity: Normal Priority: P5 - None Component: Basesystem Assignee: bnc-team-screening@forge.provo.novell.com Reporter: zlliu@suse.com QA Contact: qa-bugs@suse.de Found By: --- Blocker: --- kernel version: 4.1.12-1-default OS: openSUSE Leap 42.1 Details: I try to reshape one raid1 to raid5 by using mdadm. And systemctl doesn't escape the argument in Service file. zlliu:~/mdadm # cat /usr/lib/systemd/system/mdadm-grow-continue@.service # This file is part of mdadm. # # mdadm is free software; you can redistribute it and/or modify it # under the terms of the GNU General Public License as published by # the Free Software Foundation; either version 2 of the License, or # (at your option) any later version. [Unit] Description=Manage MD Reshape on /dev/%I DefaultDependencies=no [Service] ExecStart=/sbin/mdadm --grow --continue /dev/%I StandardInput=null StandardOutput=null StandardError=null KillMode=none Test steps: 1. # dd if=/dev/zero of=test1 bs=1M count=100 # dd if=/dev/zero of=test2 bs=1M count=100 # dd if=/dev/zero of=test3 bs=1M count=100 # losetup /dev/loop(0,1,2) test(1,2,3) 2. # mdadm -CR /dev/md1 -b internal -l1 -n2 /dev/loop[0-1] # mdadm --grow /dev/md1 -l5 -n3 -a /dev/loop2 You can check "systemctl status mdadm-grow-continue@md1.service" and "journalctl -xn" The same situation(include of mdadm source code and same testing steps) cannot reproduce with SLES 12 SP2 and Tumbleweed. Reproduce: Always -- You are receiving this mail because: You are on the CC list for the bug.
![](https://seccdn.libravatar.org/avatar/3035b38ff33cf86f480bb169b8500b80.jpg?s=120&d=mm&r=g)
http://bugzilla.suse.com/show_bug.cgi?id=1033098 Chenzi Cao <chcao@suse.com> changed: What |Removed |Added ---------------------------------------------------------------------------- Assignee|bnc-team-screening@forge.pr |nfbrown@suse.com |ovo.novell.com | -- You are receiving this mail because: You are on the CC list for the bug.
![](https://seccdn.libravatar.org/avatar/3035b38ff33cf86f480bb169b8500b80.jpg?s=120&d=mm&r=g)
http://bugzilla.suse.com/show_bug.cgi?id=1033098 http://bugzilla.suse.com/show_bug.cgi?id=1033098#c1 --- Comment #1 from Neil Brown <nfbrown@suse.com> --- Please run strace -o /tmp/strace -s 100 -f -p 1 while attempting the reshape. Then attach /tmp/strace That should give some hint how have mdadm is getting, and why it is failing. -- You are receiving this mail because: You are on the CC list for the bug.
![](https://seccdn.libravatar.org/avatar/3035b38ff33cf86f480bb169b8500b80.jpg?s=120&d=mm&r=g)
http://bugzilla.suse.com/show_bug.cgi?id=1033098 http://bugzilla.suse.com/show_bug.cgi?id=1033098#c2 --- Comment #2 from zhilong liu <zlliu@suse.com> --- Created attachment 720650 --> http://bugzilla.suse.com/attachment.cgi?id=720650&action=edit strace Details steps: zlliu:~/mdadm # ./mdadm -CR /dev/md2 -b internal -l1 -n2 /dev/loop[0-1] --assume-clean zlliu:~/mdadm # cat /proc/mdstat Personalities : [multipath] [raid1] [raid6] [raid5] [raid4] md2 : active raid1 loop1[1] loop0[0] 19968 blocks super 1.2 [2/2] [UU] bitmap: 0/1 pages [0KB], 65536KB chunk unused devices: <none> You have mail in /var/mail/root Open another terminal: zlliu:~/mdadm # dmesg -c zlliu:~ # strace -o /tmp/strace -s 100 -f -p 1 Process 1 attached Process 1914 attached (this line printed after executed the grow command) zlliu:~/mdadm # ./mdadm --grow /dev/md2 -l5 -n3 -a /dev/loop2 mdadm: level of /dev/md2 changed to raid5 mdadm: added /dev/loop2 mdadm: Need to backup 128K of critical section.. zlliu:~/mdadm # cat /proc/mdstat Personalities : [multipath] [raid1] [raid6] [raid5] [raid4] md2 : active raid5 loop2[2] loop1[1] loop0[0] 19968 blocks super 1.2 level 5, 64k chunk, algorithm 2 [3/3] [UUU] [>....................] reshape = 0.0% (0/19968) finish=956.8min speed=0K/sec bitmap: 0/1 pages [0KB], 65536KB chunk unused devices: <none> zlliu:~/mdadm # ps -ef | grep raid root 1676 2 0 10:15 ? 00:00:00 [raid5wq] root 1905 2 0 10:31 ? 00:00:00 [md2_raid5] root 1926 1201 0 10:31 pts/0 00:00:00 grep --color=auto raid zlliu:~/mdadm # cat /proc/1676/stack [<ffffffff810814bb>] rescuer_thread+0x27b/0x310 [<ffffffff81086581>] kthread+0xc1/0xe0 [<ffffffff8165f462>] ret_from_fork+0x42/0x70 [<ffffffffffffffff>] 0xffffffffffffffff zlliu:~/mdadm # cat /proc/1905/stack [<ffffffffa032c72b>] md_thread+0xeb/0x120 [md_mod] [<ffffffff81086581>] kthread+0xc1/0xe0 [<ffffffff8165f462>] ret_from_fork+0x42/0x70 [<ffffffffffffffff>] 0xffffffffffffffff zlliu:~/mdadm # cat /proc/1914/stack cat: /proc/1914/stack: No such file or directory zlliu:~/mdadm # dmesg -c [ 1164.436377] md/raid:md2: device loop1 operational as raid disk 1 [ 1164.436381] md/raid:md2: device loop0 operational as raid disk 0 [ 1164.436591] md/raid:md2: allocated 2250kB [ 1164.444087] md/raid:md2: raid level 5 active with 2 out of 2 devices, algorithm 2 [ 1164.444089] RAID conf printout: [ 1164.444090] --- level:5 rd:2 wd:2 [ 1164.444091] disk 0, o:1, dev:loop0 [ 1164.444092] disk 1, o:1, dev:loop1 [ 1164.444105] md/raid456: discard support disabled due to uncertainty. [ 1164.444106] Set raid456.devices_handle_discard_safely=Y to override. [ 1164.662276] md: bind<loop2> [ 1165.664048] RAID conf printout: [ 1165.664053] --- level:5 rd:3 wd:3 [ 1165.664055] disk 0, o:1, dev:loop0 [ 1165.664056] disk 1, o:1, dev:loop1 [ 1165.664058] disk 2, o:1, dev:loop2 [ 1165.668863] md: reshape of RAID array md2 [ 1165.668866] md: minimum _guaranteed_ speed: 1000 KB/sec/disk. [ 1165.668867] md: using maximum available idle IO bandwidth (but not more than 2000 KB/sec) for reshape. [ 1165.668869] md: using 128k window, over a total of 19968k. zlliu:~/mdadm # journalctl -xn -- Logs begin at Fri 2017-04-07 15:47:05 CST, end at Tue 2017-04-11 10:31:07 CST. -- Apr 11 10:31:06 zlliu kernel: md: bind<loop2> Apr 11 10:31:07 zlliu kernel: RAID conf printout: Apr 11 10:31:07 zlliu kernel: --- level:5 rd:3 wd:3 Apr 11 10:31:07 zlliu kernel: disk 0, o:1, dev:loop0 Apr 11 10:31:07 zlliu kernel: disk 1, o:1, dev:loop1 Apr 11 10:31:07 zlliu kernel: disk 2, o:1, dev:loop2 Apr 11 10:31:07 zlliu kernel: md: reshape of RAID array md2 Apr 11 10:31:07 zlliu kernel: md: minimum _guaranteed_ speed: 1000 KB/sec/disk. Apr 11 10:31:07 zlliu kernel: md: using maximum available idle IO bandwidth (but not more than 2000 KB/sec) for reshape. Apr 11 10:31:07 zlliu kernel: md: using 128k window, over a total of 19968k. zlliu:~/mdadm # systemctl status mdadm-grow-continue@md2.service mdadm-grow-continue@md2.service - Manage MD Reshape on /dev/md2 Loaded: loaded (/usr/lib/systemd/system/mdadm-grow-continue@.service; static) Active: failed (Result: exit-code) since Tue 2017-04-11 10:31:07 CST; 5min ago Process: 1914 ExecStart=/sbin/mdadm --grow --continue /dev/%I (code=exited, status=2) Main PID: 1914 (code=exited, status=2) Thanks, -Zhilong -- You are receiving this mail because: You are on the CC list for the bug.
![](https://seccdn.libravatar.org/avatar/3035b38ff33cf86f480bb169b8500b80.jpg?s=120&d=mm&r=g)
http://bugzilla.suse.com/show_bug.cgi?id=1033098 http://bugzilla.suse.com/show_bug.cgi?id=1033098#c3 Neil Brown <nfbrown@suse.com> changed: What |Removed |Added ---------------------------------------------------------------------------- Status|NEW |CONFIRMED --- Comment #3 from Neil Brown <nfbrown@suse.com> --- Look in the strace for where process 1691 exits, and just before there we see: open("/sys/block//md//sync_max", O_RDWR) Clearly there is something wrong with that file name. It should be /sys/block/md0/md//sync_max Looking at earlier calls in the strace, and comparing with code in Grow_continue_command, it is quiet easy to identify 1691 ioctl(3, GET_ARRAY_INFO, 0x7fff1181b040) = 0 then 1691 ioctl(3, GET_DISK_INFO, 0x7fff1181b020) = 0 and then the strace shows /dev/ being scanned by map_dev() This continues to 1691 open("/dev/loop0", O_RDONLY|O_DIRECT) = 4 which matches fd2 = dev_open(dv, O_RDONLY); in the code. A little later is the call if (verify_reshape_position(content, content->array.level) < 0) { and it is verify_reshape_position() which tries to read sync_max, and fails. It expects to find the array name (md0) in 'info' which is passed from the variable 'content' which was set content = &array; But array.sys_name is never set. It should set by a call to sysfs_init() The else branch of "if (st->ss->external == 0) {" in Grow_continue_command() contains a call to sysfs_init(), but the 'then' branch doesn't. Can you try adding a call like sysfs_init(content, fd, NULL); after content = &array; in Grow_continue_command(), and see if that fixes the problem? -- You are receiving this mail because: You are on the CC list for the bug.
![](https://seccdn.libravatar.org/avatar/3035b38ff33cf86f480bb169b8500b80.jpg?s=120&d=mm&r=g)
http://bugzilla.suse.com/show_bug.cgi?id=1033098 http://bugzilla.suse.com/show_bug.cgi?id=1033098#c4 --- Comment #4 from zhilong liu <zlliu@suse.com> --- Sorry for the late response. # Grow.c static int reshape_array() ... ... if (!forked && !check_env("MDADM_NO_SYSTEMCTL")) if (continue_via_systemd(container ?: sra->sys_name)) { free(fdlist); free(offsets); sysfs_free(sra); return 0; } both the Grow_continue() and Grow_continue_command() haven't ever invoked, they're only working until the mdadm-grow-continue@%s.service works correctly in back-ground. In this case, mdadm has sent "systemctl start mdadm-grow-continue@%s.service" correctly, but the service got failure due to exitcode=2, and mdadm application process has already (exited 0) at this time. The root cause has been locked, the ExecStart of service is used '/sbin/mdadm', the mdadm package of Leap42.1 is too old: # rpm -qa | grep mdadm mdadm-3.3.1-8.1.x86_64 The reshape can work well after I replace the /sbin/mdadm as latest source code(v4.0, Jan-2017 Year). But there is one new bug about the commit of mdadm "3a77acd7170199adc690332ded37c41f067c720e, udev-md-raid-assembly.rules: Skip non-ready devices". in order to convenient to track this issue, I would open one bug against SLES12 SP3, and duplicate with this bug. please feel free to change this bug status, or invalid or submit request mdadm package update against "Leap42.1 maintain update". Thanks, -Zhilong -- You are receiving this mail because: You are on the CC list for the bug.
![](https://seccdn.libravatar.org/avatar/3035b38ff33cf86f480bb169b8500b80.jpg?s=120&d=mm&r=g)
http://bugzilla.suse.com/show_bug.cgi?id=1033098 zhilong liu <zlliu@suse.com> changed: What |Removed |Added ---------------------------------------------------------------------------- See Also| |http://bugzilla.suse.com/sh | |ow_bug.cgi?id=1033730 -- You are receiving this mail because: You are on the CC list for the bug.
![](https://seccdn.libravatar.org/avatar/3035b38ff33cf86f480bb169b8500b80.jpg?s=120&d=mm&r=g)
http://bugzilla.suse.com/show_bug.cgi?id=1033098 http://bugzilla.suse.com/show_bug.cgi?id=1033098#c5 --- Comment #5 from Neil Brown <nfbrown@suse.com> --- I was curious so I used git-bisect to find out when this was fixed. It was fixed by Commit: 8e7ddc5f50af ("Grow: fix problem with --grow --continue") in mdadm 3.3.3 We like to keep Leap42.1 the same as SLE12-SP1, so it might be good to backport this patch to SLE12-SP1 and then submit that to Leap42.1. Maybe. -- You are receiving this mail because: You are on the CC list for the bug.
![](https://seccdn.libravatar.org/avatar/3035b38ff33cf86f480bb169b8500b80.jpg?s=120&d=mm&r=g)
http://bugzilla.suse.com/show_bug.cgi?id=1033098 http://bugzilla.suse.com/show_bug.cgi?id=1033098#c6 --- Comment #6 from zhilong liu <zlliu@suse.com> --- (In reply to Neil Brown from comment #5)
I was curious so I used git-bisect to find out when this was fixed. It was fixed by Commit: 8e7ddc5f50af ("Grow: fix problem with --grow --continue") in mdadm 3.3.3
We like to keep Leap42.1 the same as SLE12-SP1, so it might be good to backport this patch to SLE12-SP1 and then submit that to Leap42.1. Maybe.
I would verify this commit on Leap42.1 today. "In addition, I'm writing one email to ask for help about --size feature." Thanks, -Zhilong -- You are receiving this mail because: You are on the CC list for the bug.
![](https://seccdn.libravatar.org/avatar/3035b38ff33cf86f480bb169b8500b80.jpg?s=120&d=mm&r=g)
http://bugzilla.suse.com/show_bug.cgi?id=1033098 http://bugzilla.suse.com/show_bug.cgi?id=1033098#c7 --- Comment #7 from zhilong liu <zlliu@suse.com> --- OS: 4.1.12-1-default (Leap 42.1) Reshape doesn't work after I patched the commit of "8e7ddc5f50af". Steps: 1. osc checkout and branch the project to home:zlliu:branches:openSUSE:Leap:42.1:Update/mdadm. 2. add the commit "8e7ddc5f50af" and rebuild. 3. replace the mdadm package. cluster11:~/mdadm # mdadm -CR /dev/md0 -b internal -l1 -n2 /dev/loop1 /dev/loop2 cluster11:~/mdadm # mdadm --grow /dev/md0 -l5 -n3 -a /dev/loop3 mdadm: level of /dev/md0 changed to raid5 mdadm: added /dev/loop3 mdadm: Need to backup 128K of critical section.. mdadm: Cannot start reshape for /dev/md0 mdadm: aborting level change unfreeze cluster11:~/mdadm # dmesg -c [ 334.339269] md/raid:md0: device loop2 operational as raid disk 1 [ 334.339273] md/raid:md0: device loop1 operational as raid disk 0 [ 334.339452] md/raid:md0: allocated 2250kB [ 334.344096] md/raid:md0: raid level 5 active with 2 out of 2 devices, algorithm 2 [ 334.344098] RAID conf printout: [ 334.344099] --- level:5 rd:2 wd:2 [ 334.344100] disk 0, o:1, dev:loop1 [ 334.344101] disk 1, o:1, dev:loop2 [ 334.344113] md/raid456: discard support disabled due to uncertainty. [ 334.344114] Set raid456.devices_handle_discard_safely=Y to override. [ 334.580298] md: bind<loop3> [ 334.729490] md: could not open unknown-block(7,3). [ 334.729527] md: md_import_device returned -16 [ 335.056265] md/raid1:md0: active with 2 out of 2 mirrors [ 335.056692] md: could not open unknown-block(7,3). [ 335.056729] md: md_import_device returned -16 [ 335.238767] RAID1 conf printout: [ 335.238772] --- wd:2 rd:2 [ 335.238774] disk 0, wo:0, o:1, dev:loop1 [ 335.238775] disk 1, wo:0, o:1, dev:loop2 Thanks, -Zhilong -- You are receiving this mail because: You are on the CC list for the bug.
![](https://seccdn.libravatar.org/avatar/3035b38ff33cf86f480bb169b8500b80.jpg?s=120&d=mm&r=g)
http://bugzilla.suse.com/show_bug.cgi?id=1033098 http://bugzilla.suse.com/show_bug.cgi?id=1033098#c8 Neil Brown <nfbrown@suse.com> changed: What |Removed |Added ---------------------------------------------------------------------------- CC| |zlliu@suse.com Flags| |needinfo?(zlliu@suse.com) --- Comment #8 from Neil Brown <nfbrown@suse.com> --- (sorry for the 6 month delay) The "-15" error is -EBUSY. That means md is trying to get exclusive access to the device, but something else already has that access. mdadm open the device with O_EXCL to get exclusive access, but closes it just before adding the device to the array: close(nfd); if (ioctl(fd, ADD_NEW_DISK, &info.disk) != 0) { pr_err("Cannot add new disk to this array\n"); As nfd was open for write access, the 'close' will cause udev to trigger a 'change' event. That could cause some process to run and open the device, possibly with O_EXCL. If this process races with the kernel acting on ADD_NEW_DISK, one of them will lose the race. The error you see if the kernel losing the race. We could possibly call system("udevadm control --stop-exec-queue"); before closing the fd, and system("udevadm control --start-exec-queue"); after the ADD_NEW_DISK completed, but that feels rather clumsy. It is almost certainly "mdadm -I" which gets run by udev when the fd is closed. We can prevent that using map_lock(). i.e. declare struct map_ent *map = NULL; Then before close(nfd); run map_lock(&map); and after the ioctl, run map_unlock(&map); Can you try making that change by hand (assuming you can still reproduce the problem) or would you like me to provide a patch? -- You are receiving this mail because: You are on the CC list for the bug.
![](https://seccdn.libravatar.org/avatar/3035b38ff33cf86f480bb169b8500b80.jpg?s=120&d=mm&r=g)
http://bugzilla.suse.com/show_bug.cgi?id=1033098 http://bugzilla.suse.com/show_bug.cgi?id=1033098#c9 Neil Brown <nfbrown@suse.com> changed: What |Removed |Added ---------------------------------------------------------------------------- Status|CONFIRMED |RESOLVED Resolution|--- |WORKSFORME --- Comment #9 from Neil Brown <nfbrown@suse.com> --- This doesn't seem to be going anywhere, so I'm going to close it. Feel free to re-open if new information becomes available. -- You are receiving this mail because: You are on the CC list for the bug.
participants (1)
-
bugzilla_noreply@novell.com