[Bug 495224] New: Imroper handling of multipath volumes?
http://bugzilla.novell.com/show_bug.cgi?id=495224 Summary: Imroper handling of multipath volumes? Classification: openSUSE Product: openSUSE 11.1 Version: Final Platform: All OS/Version: openSUSE 11.1 Status: NEW Severity: Major Priority: P5 - None Component: Basesystem AssignedTo: bnc-team-screening@forge.provo.novell.com ReportedBy: nice@titanic.nyme.hu QAContact: qa@suse.de Found By: --- Created an attachment (id=285908) --> (http://bugzilla.novell.com/attachment.cgi?id=285908) INformation about my disk system User-Agent: Mozilla/5.0 (X11; U; Linux x86_64; hu-HU; rv:1.9.0.8) Gecko/2009032600 SUSE/3.0.8-1.1.1 Firefox/3.0.8 I have a relatively copmlex disk layout, with many multipathed volumes for xen domains (see an attachment later). openSUSE 10.3's bootloader stuffs were able to handle this situation, while SLES 10 SP2 was unable, and now openSUSE 11.1 is also unable. For example, when I start YaST's bootloader configuration tool, it hangs in the "Chencking boot loader phase..." When I run mkinitrd, it creates the ram disks well, but the hangs FOREVER, because a child process of it (update-bootloader) hangs with ethernal 100% CPU usage. It means that I cannot even run a kernel update, because that process will hang forever. The only workaround is to restart the machine with unplugged fibrechannel optics (using the local /dev/sda disk only, which contains the dom0) in order to be able to update the kernel or the bootloader configuration. All of my fibrechannel multipath volumes (serving as disks for the domUs) are partitioned, many of them contain bootable (active) partitions, and some of them contain even installed grub bootloaders. In addition to the fact that yast2 modules hang, our storage system (Sun StorageTek 6140) is sending error reports during yast2 is struggling. These messages mean that the server's multipath subsystem is unnecessarily changing the paths to reach the fibrachannel volumes. Moreover, we are always getting these error messages during server startups! My idea is, that the root of this problem is that some disk-related subsystems (maybe the yas2 partitioner, boot loader conf., mkinitrd, and something during startup) tries to access the /dev files representing the separate paths, which is an incorrect behaviour. For example, I observed that accessing a partition on a separate path will lead to an error message (the device is busy or exclusively used, or something like that). Reproducible: Always -- Configure bugmail: http://bugzilla.novell.com/userprefs.cgi?tab=email ------- You are receiving this mail because: ------- You are on the CC list for the bug.
http://bugzilla.novell.com/show_bug.cgi?id=495224
User nice@titanic.nyme.hu added comment
http://bugzilla.novell.com/show_bug.cgi?id=495224#c1
--- Comment #1 from Tamás Németh
http://bugzilla.novell.com/show_bug.cgi?id=495224
User nice@titanic.nyme.hu added comment
http://bugzilla.novell.com/show_bug.cgi?id=495224#c2
--- Comment #2 from Tamás Németh
http://bugzilla.novell.com/show_bug.cgi?id=495224
User nice@titanic.nyme.hu added comment
http://bugzilla.novell.com/show_bug.cgi?id=495224#c3
--- Comment #3 from Tamás Németh
http://bugzilla.novell.com/show_bug.cgi?id=495224
Leon Wang
http://bugzilla.novell.com/show_bug.cgi?id=495224
User jeffm@novell.com added comment
http://bugzilla.novell.com/show_bug.cgi?id=495224#c4
Jeff Mahoney
http://bugzilla.novell.com/show_bug.cgi?id=495224
User jreidinger@novell.com added comment
http://bugzilla.novell.com/show_bug.cgi?id=495224#c5
Josef Reidinger
http://bugzilla.novell.com/show_bug.cgi?id=495224
User nice@titanic.nyme.hu added comment
http://bugzilla.novell.com/show_bug.cgi?id=495224#c6
Tamás Németh
Do you have enabled user-friendly names for multipath (from topology it looks that yes) so it is duplicate of bug 470109 . Please try disable user friendly names if this helps (fix waiting for maintenance update).
I disabled friendly names. Mkinitrd succeeded, but it took impossibly long time (~5 minutes), yast bootloader took much more (I didn't have the time to wait for it, maybe hanged forever), but: -I still get the warning messages from our storage system during system reboot and, for example yast2 bootloader startup: https://bugzilla.novell.com/show_bug.cgi?id=493327#c7 -/etc/init.d/boot.multipath is still unable to remove multipaths, sometimes even the kernel does an immediate machine reset during attempts: https://bugzilla.novell.com/show_bug.cgi?id=468826 -Some /etc/init.d startup scripts start runing while the kernel is still collecting multipath data (see the image attached later). -YaST2 partition manager wants to manage separate paths to multipath devices, not only the combined devices (see the image attached later). This whole multipath systems seems to be weird to me. http://bugzilla.novell.com/show_bug.cgi?id=470109 -- Configure bugmail: http://bugzilla.novell.com/userprefs.cgi?tab=email ------- You are receiving this mail because: ------- You are on the CC list for the bug.
http://bugzilla.novell.com/show_bug.cgi?id=495224
User nice@titanic.nyme.hu added comment
http://bugzilla.novell.com/show_bug.cgi?id=495224#c7
--- Comment #7 from Tamás Németh
http://bugzilla.novell.com/show_bug.cgi?id=495224
User nice@titanic.nyme.hu added comment
http://bugzilla.novell.com/show_bug.cgi?id=495224#c8
--- Comment #8 from Tamás Németh
http://bugzilla.novell.com/show_bug.cgi?id=495224
User jreidinger@novell.com added comment
http://bugzilla.novell.com/show_bug.cgi?id=495224#c9
Josef Reidinger
http://bugzilla.novell.com/show_bug.cgi?id=495224
User aschnell@novell.com added comment
http://bugzilla.novell.com/show_bug.cgi?id=495224#c10
Arvin Schnell
http://bugzilla.novell.com/show_bug.cgi?id=495224
User nice@titanic.nyme.hu added comment
http://bugzilla.novell.com/show_bug.cgi?id=495224#c11
Tamás Németh
What is wrong with the screenshot from comment #8?
sda is not part of any multipath device so it's displayed with all it's partitions. sdaa is included in a multipath device and thus only the disk without partitions is shows. That is the expected way. By double-clicking on the disks and selecting the Overview you can check that sda is not used by any device and that sdaa is used by a multipath device.
OK, my /etc/multipath.conf looks like this, now: defaults { user_friendly_names "no" } blacklist { devnode "^sda$" } However, LUN WWN / friendly name mappins are still present in /etc/multipath_bindings. Yesterday I wanted to start yast2/partitinor but it was unable to start in about twelve hours, so I restarted the whole machine, and today yast2/partitioner was finally able to start. If you take a look at the attached image, you can see, that three devices, which are otherwise paths to multipath volumes, can be actually managed! The three devices are: /dev/sdce; /dev/sdcf and /dev/sdcg. They are not shown in the multipath topology, instead 'multipath -d' displays this: carrier4:~ # multipath -d reload: 3600a0b80004808d200000b5548931bee n/a SUN,CSM200_R [size=600G][features=0][hwhandler=1 rdac][n/a] \_ round-robin 0 [prio=8][undef] \_ 7:0:0:9 sdx 65:112 [active][ghost] \_ 5:0:1:9 sdbg 67:160 [active][ghost] \_ round-robin 0 [prio=6][undef] \_ 5:0:0:9 sdr 65:16 [active][ready] \_ 7:0:1:9 sdce 69:32 [undef][ready] reload: 3600a0b80004808d200000b70489bf7f8 n/a SUN,CSM200_R [size=20G][features=0][hwhandler=1 rdac][n/a] \_ round-robin 0 [prio=12][undef] \_ 7:0:0:10 sdaa 65:160 [active][ready] \_ 5:0:1:10 sdbh 67:176 [active][ready] \_ round-robin 0 [prio=2][undef] \_ 5:0:0:10 sdt 65:48 [active][ghost] \_ 7:0:1:10 sdcf 69:48 [undef][ghost] reload: 3600a0b80004807f8000008cb489c01b3 n/a SUN,CSM200_R [size=300G][features=0][hwhandler=1 rdac][n/a] \_ round-robin 0 [prio=12][undef] \_ 5:0:0:11 sdu 65:64 [active][ready] \_ 7:0:1:11 sdcg 69:64 [undef][ready] \_ round-robin 0 [prio=2][undef] \_ 7:0:0:11 sdab 65:176 [active][ghost] \_ 5:0:1:11 sdbi 67:192 [active][ghost] These three multipats seem to be degraded somehow. We also use this fibrechannel storage system with Windows, without any problem. Multipath topology also shows these affected volumes to be in an unusual state (consisting of only three devices insted of four): create: 3600a0b80004808d200000b5548931bee dm-67 SUN,CSM200_R [size=600G][features=1 queue_if_no_path][hwhandler=1 rdac][rw] \_ round-robin 0 [prio=8][active] \_ 7:0:0:9 sdx 65:112 [active][ghost] \_ 5:0:1:9 sdbg 67:160 [active][ghost] \_ round-robin 0 [prio=3][enabled] \_ 5:0:0:9 sdr 65:16 [active][ready] create: 3600a0b80004808d200000b70489bf7f8 dm-70 SUN,CSM200_R [size=20G][features=1 queue_if_no_path][hwhandler=1 rdac][rw] \_ round-robin 0 [prio=12][active] \_ 7:0:0:10 sdaa 65:160 [active][ready] \_ 5:0:1:10 sdbh 67:176 [active][ready] \_ round-robin 0 [prio=1][enabled] \_ 5:0:0:10 sdt 65:48 [active][ghost] create: 3600a0b80004807f8000008cb489c01b3 dm-79 SUN,CSM200_R [size=300G][features=1 queue_if_no_path][hwhandler=1 rdac][rw] \_ round-robin 0 [prio=6][enabled] \_ 5:0:0:11 sdu 65:64 [active][ready] \_ round-robin 0 [prio=2][enabled] \_ 7:0:0:11 sdab 65:176 [active][ghost] \_ 5:0:1:11 sdbi 67:192 [active][ghost] For comparison, here is a volume in normal state: 3600a0b80004808d200000b8948b281a6 dm-81 SUN,CSM200_R [size=20G][features=1 queue_if_no_path][hwhandler=1 rdac][rw] \_ round-robin 0 [prio=12][active] \_ 7:0:0:12 sdae 65:224 [active][ready] \_ 5:0:1:12 sdbj 67:208 [active][ready] \_ round-robin 0 [prio=2][enabled] \_ 5:0:0:12 sdv 65:80 [active][ghost] \_ 7:0:1:12 sdch 69:80 [active][ghost] -- Configure bugmail: http://bugzilla.novell.com/userprefs.cgi?tab=email ------- You are receiving this mail because: ------- You are on the CC list for the bug.
http://bugzilla.novell.com/show_bug.cgi?id=495224
User nice@titanic.nyme.hu added comment
http://bugzilla.novell.com/show_bug.cgi?id=495224#c12
--- Comment #12 from Tamás Németh
http://bugzilla.novell.com/show_bug.cgi?id=495224
User aschnell@novell.com added comment
http://bugzilla.novell.com/show_bug.cgi?id=495224#c13
Arvin Schnell
http://bugzilla.novell.com/show_bug.cgi?id=495224
User hare@novell.com added comment
http://bugzilla.novell.com/show_bug.cgi?id=495224#c14
Hannes Reinecke
http://bugzilla.novell.com/show_bug.cgi?id=495224
User nice@titanic.nyme.hu added comment
http://bugzilla.novell.com/show_bug.cgi?id=495224#c15
--- Comment #15 from Tamás Németh
http://bugzilla.novell.com/show_bug.cgi?id=495224
User nice@titanic.nyme.hu added comment
http://bugzilla.novell.com/show_bug.cgi?id=495224#c16
--- Comment #16 from Tamás Németh
http://bugzilla.novell.com/show_bug.cgi?id=495224
User nice@titanic.nyme.hu added comment
http://bugzilla.novell.com/show_bug.cgi?id=495224#c17
Tamás Németh
Please attach the output of 'lsscsi' and '/var/log/messages'.
Plus you'll have to ensure that the 'host type' of the storage array is set to 'Linux'. Otherwise weird things will happen.
I've attached the files you asked. You seem to know this Sun StorageTek 6140 (rebranded LSI Engenio 3994) storage system (The system integrator who brought it to us had no real experts ;-). In fact I set the host type for every Linux host to be 'AIX failover' in 2008. My reason for doing this was, that I thought that the storage system won't let the host to change paths in order to achieve round robin functionality or failover. I thought the the host type 'Linux' is for host without multipath capability and with a single path to the storage system. What will change if I set the host types to Linux? Won't I experience performance degradation and failover incapability? -- Configure bugmail: http://bugzilla.novell.com/userprefs.cgi?tab=email ------- You are receiving this mail because: ------- You are on the CC list for the bug.
http://bugzilla.novell.com/show_bug.cgi?id=495224
User hare@novell.com added comment
http://bugzilla.novell.com/show_bug.cgi?id=495224#c18
Hannes Reinecke
http://bugzilla.novell.com/show_bug.cgi?id=495224
User nice@titanic.nyme.hu added comment
http://bugzilla.novell.com/show_bug.cgi?id=495224#c19
--- Comment #19 from Tamás Németh
The suggestion here is to set the host type to 'Linux' (if you are not using the 'rdac' hardware handler) or 'Solaris' (if you are using the 'rdac' hardware handler). Otherwise unexpected behaviour will occur as you've just seen.
For this reason I would suggest using 'Solaris' as host type
We do need the 'rdac' hardware handler here, though, as multipath has to send switch-over commands.
Which one of these do you suggest: Solaris (with Traffic Manager) Solaris (with Veritas DMP or other) -- Configure bugmail: http://bugzilla.novell.com/userprefs.cgi?tab=email ------- You are receiving this mail because: ------- You are on the CC list for the bug.
http://bugzilla.novell.com/show_bug.cgi?id=495224
User hare@novell.com added comment
http://bugzilla.novell.com/show_bug.cgi?id=495224#c20
--- Comment #20 from Hannes Reinecke
http://bugzilla.novell.com/show_bug.cgi?id=495224
User nice@titanic.nyme.hu added comment
http://bugzilla.novell.com/show_bug.cgi?id=495224#c21
--- Comment #21 from Tamás Németh
http://bugzilla.novell.com/show_bug.cgi?id=495224
User hare@novell.com added comment
http://bugzilla.novell.com/show_bug.cgi?id=495224#c22
--- Comment #22 from Hannes Reinecke
http://bugzilla.novell.com/show_bug.cgi?id=495224
User nice@titanic.nyme.hu added comment
http://bugzilla.novell.com/show_bug.cgi?id=495224#c23
--- Comment #23 from Tamás Németh
Yes, please always use hosttype 'Linux'. This actually enables 'AVT', so it's the preferred choice here. And miles better than AIX in any configuration. AIX has a really weird SCSI behaviour which should be avoided here.
So far, I was able to figure out that the 'Linux' host type disables the AVT mode, which means it expects the host to do explicit (e.g. RDAC, which seems to be supported by openSUSE 11.1's kernel) failover instead of implicit. 'Solaris (with Veritas DMP or other)' seems to be the right choice for implicit failover on Linux (older systems). Maybe I should try the instead of 'AIX failover', because when I set the host type to 'Linux', my hosts are almost always unable to finish the boot process! -- Configure bugmail: http://bugzilla.novell.com/userprefs.cgi?tab=email ------- You are receiving this mail because: ------- You are on the CC list for the bug.
http://bugzilla.novell.com/show_bug.cgi?id=495224
User nice@titanic.nyme.hu added comment
http://bugzilla.novell.com/show_bug.cgi?id=495224#c24
Tamás Németh
http://bugzilla.novell.com/show_bug.cgi?id=495224
User nice@titanic.nyme.hu added comment
http://bugzilla.novell.com/show_bug.cgi?id=495224#c25
--- Comment #25 from Tamás Németh
http://bugzilla.novell.com/show_bug.cgi?id=495224
User nice@titanic.nyme.hu added comment
http://bugzilla.novell.com/show_bug.cgi?id=495224#c26
--- Comment #26 from Tamás Németh
http://bugzilla.novell.com/show_bug.cgi?id=495224
User nice@titanic.nyme.hu added comment
http://bugzilla.novell.com/show_bug.cgi?id=495224#c27
--- Comment #27 from Tamás Németh
http://bugzilla.novell.com/show_bug.cgi?id=495224
User nice@titanic.nyme.hu added comment
http://bugzilla.novell.com/show_bug.cgi?id=495224#c28
--- Comment #28 from Tamás Németh
http://bugzilla.novell.com/show_bug.cgi?id=495224
User nice@titanic.nyme.hu added comment
http://bugzilla.novell.com/show_bug.cgi?id=495224#c29
--- Comment #29 from Tamás Németh
http://bugzilla.novell.com/show_bug.cgi?id=495224
User hare@novell.com added comment
http://bugzilla.novell.com/show_bug.cgi?id=495224#c30
Hannes Reinecke
http://bugzilla.novell.com/show_bug.cgi?id=495224
User nice@titanic.nyme.hu added comment
http://bugzilla.novell.com/show_bug.cgi?id=495224#c31
Tamás Németh
Have you tried to press 'Control-C' when the boot process stalled?
Yes, pressing ctrl-c lead to make the system to finish the boot process (unusually quickly), but it behaved like it was missing some of the filesystems, or something. I was even unable to log in, because it was complaining about some missing service component. Please take a look at the attached image. -- Configure bugmail: http://bugzilla.novell.com/userprefs.cgi?tab=email ------- You are receiving this mail because: ------- You are on the CC list for the bug.
http://bugzilla.novell.com/show_bug.cgi?id=495224
User hare@novell.com added comment
http://bugzilla.novell.com/show_bug.cgi?id=495224#c32
Hannes Reinecke
http://bugzilla.novell.com/show_bug.cgi?id=495224
User nice@titanic.nyme.hu added comment
http://bugzilla.novell.com/show_bug.cgi?id=495224#c33
Tamás Németh
Please look in /dev/.udev/failed. It looks as if some events couldn't be processed before the timeout.
Sorry, I'm not an expert of udev, so, I don't know what /dev/.udev/failed is for. All I can see there is a broken symbolic link: carrier4:/dev/.udev/failed # ls -l /dev/.udev/failed total 0 lrwxrwxrwx 1 root root 46 2009-07-03 13:01 \x2fdevices\x2fplatform\x2fmicrocode\x2ffirmware\x2fmicrocode -> /devices/platform/microcode/firmware/microcode -- Configure bugmail: http://bugzilla.novell.com/userprefs.cgi?tab=email ------- You are receiving this mail because: ------- You are on the CC list for the bug.
http://bugzilla.novell.com/show_bug.cgi?id=495224
User hare@novell.com added comment
http://bugzilla.novell.com/show_bug.cgi?id=495224#c34
Hannes Reinecke
http://bugzilla.novell.com/show_bug.cgi?id=495224
User nice@titanic.nyme.hu added comment
http://bugzilla.novell.com/show_bug.cgi?id=495224#c35
Tamás Németh
... And which is exactly what I was looking for. So there was an microcode uevent which couldn't be handled.
This is why multipath volumes are handled improperly? -- Configure bugmail: http://bugzilla.novell.com/userprefs.cgi?tab=email ------- You are receiving this mail because: ------- You are on the CC list for the bug.
http://bugzilla.novell.com/show_bug.cgi?id=495224
User hare@novell.com added comment
http://bugzilla.novell.com/show_bug.cgi?id=495224#c36
Hannes Reinecke
Finally I set the host type to 'Linux' (non-AVT, RDAC supported), but the xen kernel always(?) fails to load the system. Is stalls at a certain point (see and attached image later). Default kernel also stalls sometimes (seen another attched image), but sometimes succeeds to load the system and then:
And this doesn't have anything to do with multipathing. -- Configure bugmail: http://bugzilla.novell.com/userprefs.cgi?tab=email ------- You are receiving this mail because: ------- You are on the CC list for the bug.
http://bugzilla.novell.com/show_bug.cgi?id=495224
User nice@titanic.nyme.hu added comment
http://bugzilla.novell.com/show_bug.cgi?id=495224#c37
--- Comment #37 from Tamás Németh
Not, but it's the reason for
Finally I set the host type to 'Linux' (non-AVT, RDAC supported), but the xen kernel always(?) fails to load the system. Is stalls at a certain point (see and attached image later). Default kernel also stalls sometimes (seen another attched image), but sometimes succeeds to load the system and then:
And this doesn't have anything to do with multipathing.
Do you mean that the CPU microcode is somehow influenced by the configuration of the external (fibrechannel) storage system? -- Configure bugmail: http://bugzilla.novell.com/userprefs.cgi?tab=email ------- You are receiving this mail because: ------- You are on the CC list for the bug.
http://bugzilla.novell.com/show_bug.cgi?id=495224
User hare@novell.com added comment
http://bugzilla.novell.com/show_bug.cgi?id=495224#c38
--- Comment #38 from Hannes Reinecke
http://bugzilla.novell.com/show_bug.cgi?id=495224
User nice@titanic.nyme.hu added comment
http://bugzilla.novell.com/show_bug.cgi?id=495224#c39
--- Comment #39 from Tamás Németh
No, I mean the issue of a system stalling during boot and any fibrechannel configuration problems are totally unrelated. Hence the former warrants a new bugzilla entry.
OK, I understand, but as you can read, the total boot stalling only happens with a certain (external) storage setting (and with the xen kernel). With a different storage setting even the xen kernel is able to boot the system. My problem is that multipath handling is unreliable, as described in this report and in others also submitted by me. -- Configure bugmail: http://bugzilla.novell.com/userprefs.cgi?tab=email ------- You are receiving this mail because: ------- You are on the CC list for the bug.
https://bugzilla.novell.com/show_bug.cgi?id=495224
https://bugzilla.novell.com/show_bug.cgi?id=495224#c40
Hannes Reinecke
participants (1)
-
bugzilla_noreply@novell.com