[Bug 524018] New: [multipathd] Race condition when new devices are discovered, but SysFS entries don't have all the data.
http://bugzilla.novell.com/show_bug.cgi?id=524018 Summary: [multipathd] Race condition when new devices are discovered, but SysFS entries don't have all the data. Classification: openSUSE Product: openSUSE 10.3 Version: Final Platform: All OS/Version: openSUSE 10.3 Status: NEW Severity: Normal Priority: P5 - None Component: Basesystem AssignedTo: bnc-team-screening@forge.provo.novell.com ReportedBy: konrad@virtualiron.com QAContact: qa@suse.de Found By: --- Created an attachment (id=306674) --> (http://bugzilla.novell.com/attachment.cgi?id=306674) Back-port from SLES10. Re-introduce add_wait. User-Agent: Mozilla/5.0 (X11; U; Linux i686 (x86_64); en-US; rv:1.8.112) Gecko/20080201 Firefox/2.0.0.12 Here are the details from the e-mail correspondence on device-mapper mailing list with me and Hannes Reinecke:
Could also be a race condition that is present in SLES10 + RHEL5 kernels. Where the SysFS directories are created (and the udev event it sent out), but the kernel hasn't populated the SysFS directories. So when multipathd tries to read them it finds no pertient information and shoves it off to the 'orphan' state.
Really? With SLES10? Have you actually observed this?
With SLES10 SP2 to be exact. It wasn't an issue with SLES10 since the initial patch was there. The equipment I used to test this was an AX150FC with failed batteries (so no cache writes) and with a failed controller so it would run extra slow.
We're running multipath _after_ udev has processed the event.
Right, the one where the SysFS directory is created. Then multipatd reads the data. I remember posting it here and mentioning that this problem exists on SLES10SP2 and RHEL5 but not on the upstream kernels.
And udev already waited for sysfs, so we should be safe there.
Not so. The udev gets the SCSI uevent creation, creates the /dev/sdX, and so. But the kernel hasn't yet fully populated the SysFS entries (so /sys/block/sdX/device/vendor does exist, but has no data in it).
It might be applicable to mainline multipath-tools, but
It really depends on how the SysFS directories are populated and how slow the SCSI target is.
the SLES10 one ... I'd be surprised.
Well, reasonably surprised. multipath keeps on throwing an amazing number of issues still.
Do you have more information here?
Here is the patch along with a detailed description.
The "multipath-tools-add-wait" patch is a backport/write of the wait_for_file routine used in the sysfs_get_[vendor|model|rev] macros. The SLES10 SP2 back-ported a lot of the upstream features of multipath, and one of those was getting rid of this function. I haven't yet found out the reason why it was deleted - looks as if a mistake as the upstream kernel _should_ cause the same set of problems with multipath. [update: Upstream kernel has this fixed]
The reason a wait is necessary is due to the way the kernel sends the event. When a SCSI device is added the SCSI subsystem pursues this path:
_sysfs_add_sdev: calls device_add ... [ '/devices/platform/host16/session6/target16:0:0/16:0:0:17'] uevent bus_attach_device bus_for_each_drv driver_probe_device sd_probe ['/class/scsi_disk/16:0:0:17' ] uevent add_disk ['/block/sdai'] [ Here multipath starts its job ]
calls class_device_add ... [ '/class/scsi_device/16:0:0:17' ] uevent sg_add: [ '/class/scsi_generic/sg35' ] uevent
done with device_add, and now we add the attributes: --> scsi_sysfs_sdev_attrs[i].vendor, model, rev <-- THIS is the problem.
[Multipathd at the 'block/sdai' event has started analyzing the data, and it reads the SysFS, but the 'vendor', 'model' have no data so multipathd discards them an orphans the devices. That data gets to be there once 'device_add' is finished.]
Ah. Hmm. Seems you are correct. I'll have to apply the patch, then. Fancy opening a bugzilla for it? Reproducible: Always Steps to Reproduce: 1. Use a very slow SCSI storage 2. Observe. Actual Results: 1. See machine go in flames! -- Configure bugmail: http://bugzilla.novell.com/userprefs.cgi?tab=email ------- You are receiving this mail because: ------- You are on the CC list for the bug.
http://bugzilla.novell.com/show_bug.cgi?id=524018
Hannes Reinecke
http://bugzilla.novell.com/show_bug.cgi?id=524018
User hare@novell.com added comment
http://bugzilla.novell.com/show_bug.cgi?id=524018#c1
Hannes Reinecke
participants (1)
-
bugzilla_noreply@novell.com