https://bugzilla.novell.com/show_bug.cgi?id=795738 https://bugzilla.novell.com/show_bug.cgi?id=795738#c0 Summary: [NTAP-358169] SLES 10.4 with LifeKeeper 8.0 cluster shuts down iSCSI connections before cluster Classification: openSUSE Product: openSUSE 11.4 Version: Final Platform: x86-64 OS/Version: SLES 10 Status: NEW Severity: Normal Priority: P5 - None Component: Other AssignedTo: bnc-team-screening@forge.provo.novell.com ReportedBy: luis.salmeron@netapp.com QAContact: qa-bugs@suse.de Found By: --- Blocker: --- Created an attachment (id=518102) --> (http://bugzilla.novell.com/attachment.cgi?id=518102) syslogs for the original volume owner User-Agent: Mozilla/4.0 (compatible; MSIE 7.0; Windows NT 6.1; WOW64; Trident/4.0; SLCC2; .NET CLR 2.0.50727; .NET CLR 3.5.30729; .NET CLR 3.0.30729; Media Center PC 6.0; InfoPath.3; .NET4.0C) Four SLES 10 SP4 hosts are running in a clustered environment with SteelEye LifeKeeper 8.0. They are using MPP as a failover driver and iSCSI as a protocol. A fifth host is running IO to the cluster across an NFS connection. When trying to gracefully fail a node (executing a reboot command), IO will fail because of a read or write error. The preferred behavior is that the IO will wait up to 10 minutes for the resources to become enabled on another cluster node. The logs on the host being rebooted show that the connections to the devices are being disconnected first, then IO is being rejected to the devices, and LifeKeeper never attempts to shut down. This behavior only occurs when the nodes are being gracefully shut down. When the nodes are non-gracefully shut down (the host is power cycled), this error does not occur. This error is also avoided when a 'service lifekeeper stop' is issued before the reboot command is executed. Reproducible: Always Steps to Reproduce: 1. Create a LifeKeeper 8.0 cluster using 4 hosts with iSCSI connections 2. Each cluster resource should be shared across all nodes and should have NFS and HA qualities. The resources should all have a virtual IP address assigned to them 3. On a fifth host, create an NFS connection to the cluster virtual IP addresses and run IO to the cluster resources 4. On one of the cluster nodes, issue a reboot command Actual Results: The host that is rebooted rejects IO to the volumes on the storage array before LifeKeeper or the NFS connections go down which causes read/write errors on the host running the IO script. Expected Results: The resource ownership should change to a different cluster node and the IO host should wait for a 10 minute timeout before failing IO. The resources should fail over before the timeout and IO should resume from the secondary host. The IO should not hit any read or write errors because the connection to the volumes on the first host should not go down before the NFS or LifeKeeper. There are two workarounds that we have found. One is to issue the 'service lifekeeper stop' command before executing a reboot. The other is to perform a host powercycle instead of issuing a reboot command. The LifeKeeper packages installed on the hosts are: steeleye-lkRAW-8.0.0-5104 steeleye-lkSuSE-8.0.0-5104 steeleye-lkCCISS-8.0.0-5104 steeleye-lk-8.0.0-5104 steeleye-lkIP-8.0.0-5104 steeleye-lkDMMP-7.3.0-2 steeleye-lkapi-client-8.0.0-5104 steeleye-lkLIC-8.0.0-5104 steeleye-lkGUI-8.0.0-5104 steeleye-lkNFS-7.5.0-3640 steeleye-lkapi-8.0.0-5104 steeleye-lkDR-8.0.0-5104 steeleye-lkMAN-8.0.0-5104 -- Configure bugmail: https://bugzilla.novell.com/userprefs.cgi?tab=email ------- You are receiving this mail because: ------- You are on the CC list for the bug.