[Bug 760274] New: Default load behaviour of ch module (scsi-changer driver) harmful: causes udev timeouts and emergency-mode boot
https://bugzilla.novell.com/show_bug.cgi?id=760274 https://bugzilla.novell.com/show_bug.cgi?id=760274#c0 Summary: Default load behaviour of ch module (scsi-changer driver) harmful: causes udev timeouts and emergency-mode boot Classification: openSUSE Product: openSUSE 12.1 Version: Final Platform: x86-64 OS/Version: openSUSE 12.1 Status: NEW Severity: Normal Priority: P5 - None Component: Kernel AssignedTo: kernel-maintainers@forge.provo.novell.com ReportedBy: pkeller@globalphasing.com QAContact: qa-bugs@suse.de Found By: --- Blocker: --- Created an attachment (id=489116) --> (http://bugzilla.novell.com/attachment.cgi?id=489116) Output of 'dmesg' immediately after an unsuccessful boot User-Agent: Mozilla/5.0 (X11; Linux x86_64; rv:12.0) Gecko/20100101 Firefox/12.0 Observed on a system with a 16-slot tape library, and LVM logical volumes created on a md software RAID. With just two tapes in the library, the system boots successfully, with some timeouts from udev. With 10 tapes in the library, there are numerous udev timeouts, filesystems on the LVM logical volumes are not found/mounted during boot, and the system drops into emergency mode. This has been discussed on the openSUSE forums here: http://forums.opensuse.org/english/get-technical-help-here/install-boot-logi... Hardware details: Motherboard: Intel S5520HCR CPU: Intel Xeon E5620 Tape library: Quantum Superloader 3/LTO-5 drive. Connected to an LSI SAS 9200-8e HBA RAID: Seagate ST33000650NS disks (8 active; 2 spare). Connected internally to an LSI SAS 9201-16i HBA Disk containing OS install: Intel SSDSA2C208 (connected to motherboard) Even on a successful boot, dmesg output contains some lines such as: [ 74.592812] udevd[557]: timeout: killing '/sbin/modprobe -bv scsi:t-0x01' [741] [ 74.600206] udevd[582]: timeout: killing '/sbin/modprobe -bv scsi:t-0x08' [736] [ 74.611878] udevd[588]: timeout: killing '/sbin/modprobe -bv scsi:t-0x05' [786] On an unsuccessful boot, we observe around 50 such lines for each module. In addition: [ 96.363335] systemd[1]: Job dev-disk-by\x2dlabel-amanda_regular.device/start timed out. [ 96.366740] systemd[1]: Job remote-fs-pre.target/start failed with result 'dependency'. [ 96.366746] systemd[1]: Job local-fs.target/start failed with result 'dependency'. [ 96.366750] systemd[1]: Triggering OnFailure= dependencies of local-fs.target. [ 96.370341] systemd[1]: Job mnt-data-irregular.mount/start failed with result 'dependency'. [ 96.373989] systemd[1]: Job fsck@dev-disk-by\x2dlabel-data_irregular.service/start failed with result 'dependency'. [ 96.377759] systemd[1]: Job mnt-data-regular.mount/start failed with result 'dependency'. [ 96.381638] systemd[1]: Job fsck@dev-disk-by\x2dlabel-data_regular.service/start failed with result 'dependency'. [ 96.381978] systemd[1]: Job mnt-amanda_holding_regular.mount/start failed with result 'dependency'. [ 96.381982] systemd[1]: Job dev-disk-by\x2dlabel-amanda_regular.device/start failed with result 'timeout'. [ 98.202460] systemd[1]: md.service: control process exited, code=exited status=3 [ 98.254235] systemd[1]: Unit md.service entered failed state. [ 98.339054] boot.md[1039]: Not shutting down MD RAID - reboot/halt scripts do this...missing and: [ 132.413738] ch0: ... finished [ 132.413743] ch 5:0:0:1: Attached scsi changer ch0 [ 132.414401] st 5:0:0:0: Attached scsi tape st0 [ 132.414404] st 5:0:0:0: st0: try direct i/o: yes (alignment 4 B) [ 132.414552] udevd[582]: '/sbin/modprobe -bv scsi:t-0x08' [736] terminated by signal 9 (Killed) [ 132.418085] udevd[557]: '/sbin/modprobe -bv scsi:t-0x01' [741] terminated by signal 9 (Killed) [ 132.421387] udevd[588]: '/sbin/modprobe -bv scsi:t-0x05' [786] terminated by signal 9 (Killed) [ 132.744011] systemd[1]: Startup finished in 6s 13ms 233us (kernel) + 2min 6s 727ms 5us (userspace) = 2min 12s 740ms 238us. and the system starts up in emergency mode. Note that we changed the sysconfig parameter MDADM_DEVICE_TIMEOUT from 60 to 240, otherwise there were additional messages from md.service that the udev event queue was not empty. Thi change was not sufficient to solve the problem though. Also, as can be seen on the forum post, attempts to solve the problem using udev-settle.service were unsuccessful. During booting, the tape library would operate: mechanical operating sounds were heard and the tape library management interface showed that move/calibrate operations were being done. This implicated the scsi-changer driver (/lib/modules/3.1.10-1.9-default/kernel/drivers/scsi/ch.ko). 'modinfo ch' shows that the module by default does an initialization on load, which has a timeout of 1 hour. Adding the following line to /etc/modprobe.d/99-local.conf allows the system to boot normally: options ch init=0 According to discussion on the forum post referenced above, no kernel module should take more than 60s to load, and udev is now enforcing this. Other drivers have been rewritten to do load-time initialisation asynchronously, which suggests that the same should be done for scsi-changer. Loading with 'init=0' is a workaround, although after boot the module may need to be unloaded and reloaded with 'init=1'. Blacklisting it would be another workaround as long as it is not needed (e.g. the Amanda backup system uses the generic scsi driver to manipulate the changer, not scsi-changer). The nature of the interaction between the loading of ch and the hardware underlying the LVM system has not been investigated, but it may be relevant that since both HBA's are from LSI, they use the same driver (mpt2sas: the version included in openSUSE 12.1 is being used). Reproducible: Always Steps to Reproduce: 1. 2. 3. -- Configure bugmail: https://bugzilla.novell.com/userprefs.cgi?tab=email ------- You are receiving this mail because: ------- You are on the CC list for the bug.
https://bugzilla.novell.com/show_bug.cgi?id=760274
https://bugzilla.novell.com/show_bug.cgi?id=760274#c1
--- Comment #1 from Peter Keller
https://bugzilla.novell.com/show_bug.cgi?id=760274
https://bugzilla.novell.com/show_bug.cgi?id=760274#c2
--- Comment #2 from Peter Keller
https://bugzilla.novell.com/show_bug.cgi?id=760274
https://bugzilla.novell.com/show_bug.cgi?id=760274#c3
Jeff Mahoney
https://bugzilla.novell.com/show_bug.cgi?id=760274
https://bugzilla.novell.com/show_bug.cgi?id=760274#c4
--- Comment #4 from Jeff Mahoney
https://bugzilla.novell.com/show_bug.cgi?id=760274
https://bugzilla.novell.com/show_bug.cgi?id=760274#c5
--- Comment #5 from Jeff Mahoney
https://bugzilla.novell.com/show_bug.cgi?id=760274
https://bugzilla.novell.com/show_bug.cgi?id=760274#c6
Jeff Mahoney
https://bugzilla.novell.com/show_bug.cgi?id=760274
https://bugzilla.novell.com/show_bug.cgi?id=760274#c7
Hannes Reinecke
https://bugzilla.novell.com/show_bug.cgi?id=760274
https://bugzilla.novell.com/show_bug.cgi?id=760274#c9
--- Comment #9 from Peter Keller
https://bugzilla.novell.com/show_bug.cgi?id=760274
https://bugzilla.novell.com/show_bug.cgi?id=760274#c10
--- Comment #10 from Hannes Reinecke
https://bugzilla.novell.com/show_bug.cgi?id=760274
https://bugzilla.novell.com/show_bug.cgi?id=760274#c11
Jeff Mahoney
https://bugzilla.novell.com/show_bug.cgi?id=760274
https://bugzilla.novell.com/show_bug.cgi?id=760274#c14
--- Comment #14 from Bernhard Wiedemann
https://bugzilla.novell.com/show_bug.cgi?id=760274
https://bugzilla.novell.com/show_bug.cgi?id=760274#c15
Benjamin Brunner
https://bugzilla.novell.com/show_bug.cgi?id=760274
https://bugzilla.novell.com/show_bug.cgi?id=760274#c16
--- Comment #16 from Swamp Workflow Management
participants (1)
-
bugzilla_noreply@novell.com