On Tuesday 10 October 2023, Martin Wilck via openSUSE Factory wrote:
On Tue, 2023-10-10 at 07:53 +1300, Michael Hamilton wrote:
The main thing that bit me was that the order of /dev/sd* is no longer fixed which caused problems for my boot configuration and drive configuration scripts (hdparm/smartctrl post-boot scripts).
That shouldn't have anything to do with the recent change. The order of /dev/sd* has been non-deterministic at least since kernel commit f049cf1a7b67 ("scsi: sd: Rely on the driver core for asynchronous probing"), which was in upstream kernel 5.3, 4 years ago. This kernel has been in Leap since 15.2. But even before that, the order has never been truly deterministic, even if it might have seemed to be the case on some systems. The order has always been dependent on the order of SCSI drivers loaded and the order of controllers on the PCI bus.
Bottom line: don't use /dev/sd* for referring to specific devices in configuration files, ever.
0) My symptom is being unable to boot (I think people reporting slow boots are experiencing a different problem). 1) Make sure /dev/sd* isn't used in /etc/fstab. 2) Check /etc/default/grub, make sure GRUB_DISABLE_LINUX_UUID is commented out (normal case). 3) Check /boot/grub2/grub.cfg and make sure it doesn't reference /dev/sd* in other ways. 4) Track down any scripts that set parameters for any /dev/sd* devices.
Again, the recent sg3_utils change has nothing to do with /dev/sd*. It only removes certain symlinks under /dev/disk/by-id.
A lot of scripts/configs TW have been working fine for years, so what ever has perturbed the allocation of /dev/sd* is recent, but it might not be anything to do with sg3_utils. It was just that it was the most recent major announcement that mentions a similar topic.
When dealing with #4 above, I'm not sure what hdparm and smartctrl accept, so for the moment I'm using code that extracts the /dev/sd name from the output of lsscsi command.
There is a discussion going on in the forums about how to restore stability to the ordering (for those that don't want to change or want a short term fix), see:
https://forums.opensuse.org/t/problem-with-disks-order-after-snapshot-202309...
Which suggests adding scsi_mod.async_probe=0 to the bootline. Would this be the same for nvme or does scsi cover that?. The suggestion seems to work.
Strange. This is wrong syntax. The syntax of the parameter is "scsi_mod.disable_async_probing=<driver>". See https://www.suse.com/support/kb/doc/?id=000018449
Thanks. I've found that the stable assignment did not survive a cold boot (maybe I was just "lucky"). So, yes, this doesn't work. I've tried the syntax you suggested. I didn't know what to put for driver, after Googling, I used: % udevadm info -a -n /dev/sda | grep -oP 'DRIVERS?=="\K[^"]+' sd ahci pcieport And added the following to the boot line: scsi_mod.disable_async_probing=sd,ahci,pcieport This appeared to work for a couple of cold boots, the drives all got mapped as they used to be. But it failed to work on a reboot without powering down. But then it worked on a following cold boot. So I don't think this works, or at least not reliably.
This parameter exists only on Leap, not on TW (and probably not on ALP). nvme is totally different, there is no corresponding parameter.
As you suggested, adding udev.scsi_symlink_src=LTVS udev.scsi_id_serial_src=LTVS to the bootline also seems to result in the old ordering being used.
This finding is even stranger. The /dev/sd* names are assigned by the kernel. sg3_utils or udev have nothing to do with it. The only reason I can think of is that the timing during boot is somehow affected by the different udev rules.
Probably things were stable because I was not cold booting, or was "lucky".
With both suggestions above, I haven't done enough boots to be sure of stability of the /dev/sd* assignments. But either is worth a go to get back into the system and fix the /dev/sd* references.
Sorry, that won't happen, see above. I'm surprised that you haven't run into issues with this approach years ago. The fact that this hits you now must be some weird coincidence that must be inspected further. Please open a bug and assign it to me.
Bug: https://bugzilla.opensuse.org/show_bug.cgi?id=1216070 I didn't attribute the bug to sg3_utils, I don't think we're sure of that.
Thanks, Martin
Thanks for explaining all the above. I will see if I can try an older kernel (just out of interest, I've already expunged /dev/sd from being referenced persistently). Michael