SCSI device identification and SCSI symlink generation in sg3_utils 1.48
With the forthcoming update of sg3_utils to v1.48 (currently in Base:System, to be submitted to Factory soon), SCSI device identification and SCSI symlink generation by udev will change. TL;DR: unreliable device identifiers will not be used by default any more, and /dev/disk/by-id symlinks for such identifiers will not be created by default, either. The policy can be changed by editing the udev rules in 00-scsi-sg3_config.rules. Please file bugs and / or notify me if you have trouble with this new identification scheme. Of course, while this is still in Base:System, we can also discuss the general approach for Factory, and the defaults to be applied. Long version: It is well known that referring to disk devices by their device node (like /dev/sda) is unreliable. In most situations, it is recommended to refer to devices by their content, i.e. using file system UUIDs or labels. Sometimes, however, the devices must be referred to by hardware properties. For this purpose, udev maintains various symlinks under /dev/disk/by-id. This post is specifically about SCSI symlinks, which have the format /dev/disk/by-id/scsi-XYZ. If you don't use dm-multipath, and don't use hardware IDs to refer to any disks or partitions anywhere, you can stop reading here. SCSI devices can be identified in different ways by the contents of SCSI VPD (Vital Product Data) pages. In PVD 83 (device identification), the following identifiers may occur (see SCSI spec SPC-6 §7.7.6 for details): a) NAA registered extended: 36... (+ 31 hex digits) b) NAA registered: 35... (+ 15 hex digits) c) NAA extended: 32... (+ 15 hex digits) d) EUI-64: 2... (+ 16 hex digits) e) SCSI name string: 8... (+ utf-8 string) f) NAA local ("L"): 33... (+ 15 hex digits) g) T10 vendor-ID ("T): 1... (+ string) h) vendor-specific ("V") 0... (+ string) Furthermore, we have i) serial number ("S"): S... (+ vendor/product/serial string) The list above is sorted roughly top-down by the reliability (uniqueness) of these identifiers. The udev rules select the most reliable available identifier from the above list to set the ID_SERIAL udev property, which is used e.g. by dm-multipath to identify SCSI devices which represent different paths to the same physical volume. Furthermore, the udev rules create /dev/disk/by-id/scsi-$ID symlinks for all identifiers from the above list that are defined for a given device. This makes sure that whatever identifier a user has chosen for referring to her disks will be available as symlink under /dev/disk/by-id. However, this has also severe disadvantages. The identifiers f)-i) are not reliable; IOW there can be multiple devices in the same system that use the same identifier, although they don't represent the same physical disk. This applies to i) in particular: Some large storage arrays use the same serial number for all LUNs they expose. Using such identifiers for multipath can cause fatal data corruption. Moreover, the creation of these symlinks during boot can slow down booting a lot, because thousands of devices may be contending for the same symlink. Lastly, not creating useless symlinks will tidy up /dev/disk/by-id and make it easier to read. Therefore sg3_utils 1.48 changes the policy for SCSI device identification. The policy can be customized in the udev rule "00-scsi- sg3_config.rules" (to change the policy, copy this file from /usr/lib/udev/rules.d to /etc/udev/rules.d first). For determining ID_SERIAL, only a)-e) above are tried. If users need to use additional identifiers from the list above, they can set the udev property .SCSI_ID_SERIAL_SRC (leading "." is intentional) to any combination of the capital letters in quotes in the list above. E.g. "ST" would mean to take g) and i) into account. To restore the previous behavior, set ENV{.SCSI_ID_SERIAL_SRC}="LTVS". The current upstream default is "T", so that g) above is still enabled. To disable all "unreliable" IDs, set it to the empty string. If udev encounters a device for which ID_SERIAL can't be assigned, it prints a warning to the journal. Symlink generation is controlled by the variable .SCSI_SYMLINK_SRC. It can also be set to any combination of "LTVS", with the same meaning of the letters as above, and controls which scsi-... symlinks udev will create under /dev/disk/by-id/, *in addition* to the symlink /dev/disk/by-id/scsi-$ID_SERIAL, which will always be created if ID_SERIAL could be determined. Again, to restore the previous behavior of the udev rules, set ENV{.SCSI_SYMLINK_SRC}="LTVS". The default is the empty string. Note 1: The symlink generation applies to both disks and partitions. Partition symlinks have a "-part$P" string appended to the name of the symlink of their parent disk. Note 2: Modern SCSI devices usually support one or more of the "reliable" identifiers a)-e) above. I expect that only very few old devices really need one of f)-i). Inform me if you find counter- examples. Note 3: ATA and USB devices have additional identifiers generated by the transport (/dev/disk/by-id/ata-* and /dev/disk/by-id/usb-*) which are unaffected by this change. These devices often don't expose proper SCSI IDs, but they actually don't need them because the ATA or USB identifiers are unique. Regards, Martin
On Fri, 2023-09-22 at 00:22 +0200, Martin Wilck via openSUSE Factory wrote:
With the forthcoming update of sg3_utils to v1.48 (currently in Base:System, to be submitted to Factory soon), SCSI device identification and SCSI symlink generation by udev will change.
TL;DR: unreliable device identifiers will not be used by default any more, and /dev/disk/by-id symlinks for such identifiers will not be created by default, either. The policy can be changed by editing the udev rules in 00-scsi-sg3_config.rules.
To make sure this information isn't buried in the mailing list, I've created https://en.opensuse.org/SDB:SCSI_Device_Identification I've added kernel boot parameters as a means for rescue, in case this should cause regeression (I don't expect it will, but well...). The sg3_utils 1.48 update for factory is planned for week 41. Regards Martin
On Friday 29 September 2023, Martin Wilck via openSUSE Factory wrote:
On Fri, 2023-09-22 at 00:22 +0200, Martin Wilck via openSUSE Factory wrote:
With the forthcoming update of sg3_utils to v1.48 (currently in Base:System, to be submitted to Factory soon), SCSI device identification and SCSI symlink generation by udev will change.
TL;DR: unreliable device identifiers will not be used by default any more, and /dev/disk/by-id symlinks for such identifiers will not be created by default, either. The policy can be changed by editing the udev rules in 00-scsi-sg3_config.rules.
To make sure this information isn't buried in the mailing list, I've created https://en.opensuse.org/SDB:SCSI_Device_Identification
I've added kernel boot parameters as a means for rescue, in case this should cause regeression (I don't expect it will, but well...). The sg3_utils 1.48 update for factory is planned for week 41.
Regards Martin
Bumping this important thread. I think the chikens are coming home to roost on this change, there seem to be quite a few people reporting boot issues. Michael
Yep. I'm hitting problems with multi-booting with Leap 15.6 now running grub/os-prober . . . selecting one line item in grub and getting another one booted. Or, line item systems booting to emergency mode . . . . Multiple running of grub2-mkconfig seems to not bring consistent and orderly booting from among only 7 installs listed--some work, some do not--in the same drive, etc. Where would I add my name to a bug report? Or, has Michael filed one already that I can add my vote to???
On Sunday 08 October 2023, Fritz Hudnut wrote:
Yep. I'm hitting problems with multi-booting with Leap 15.6 now running grub/os-prober . . . selecting one line item in grub and getting another one booted. Or, line item systems booting to emergency mode . . . .
Multiple running of grub2-mkconfig seems to not bring consistent and orderly booting from among only 7 installs listed--some work, some do not--in the same drive, etc.
Where would I add my name to a bug report? Or, has Michael filed one already that I can add my vote to???
When looking at my own TW system and at the described change, I didn't think this was a bug. I had long since switched to labels in /etc/fstab. For me, only grub2 was causing any issues. That was due to a setting I had changed that forced it to use /dev/sda2 instead of the UUID, so that's down to me. So I haven't raised a bug. I haven't looked to see if anyone else has, I would be surprised if someone hasn't, given all the fallout reported here. Today my boot-time script that runs hdparm and smartctl tried to set the wrong drives to idle. I had hardcoded /dev/sd*, I guess I'm not out of the woods yet. Michael
On Sun, 2023-10-08 at 00:09 +0000, Fritz Hudnut wrote:
Yep. I'm hitting problems with multi-booting with Leap 15.6 now running grub/os-prober . . . selecting one line item in grub and getting another one booted. Or, line item systems booting to emergency mode . . . .
Multiple running of grub2-mkconfig seems to not bring consistent and orderly booting from among only 7 installs listed--some work, some do not--in the same drive, etc.
Where would I add my name to a bug report? Or, has Michael filed one already that I can add my vote to???
The changes that I've been talking about haven't been submitted to Leap 15.6. You can open a new bug, if it's found to be already reported it will be marked as duplicate. Regards Martin
@Martin: I don't know if my, now numerous, issues with TW and Leap relate to this problem. It's just that a huge ration of problems have seemed to beset my various openSUSE installs, all in overlapping succession . . . that it is becoming problematic to regular use of those systems. I just spent several weeks trying to get TW to simply "log in" in a reasonable time . . . finally via this list-serve finding out that "sddm-qt6" needed to be installed. Concurrent with that grub handling went awry in TW, so I moved it to Leap . . . and the problem followed over there. And, now Leap 15.6 won't revive from suspend . . . what is supposed to be "rock solid" and "tested" is now . . . coming up lame. Sounds like my personal problems with Leap are not related to this thread, but possibly the problems with again TW booting to ER mode might be??
On Mon, 2023-10-09 at 16:02 +0000, Fritz Hudnut wrote:
Sounds like my personal problems with Leap are not related to this thread, but possibly the problems with again TW booting to ER mode might be??
I need more concrete / detailed information to judge that. There's a simple test using command line parameters to figure out if this is the cause, see my previous post in this thread or the Wiki page. I am sorry to hear that you've had lots of issues with openSUSE, but the only way to get out of that mess is to report them one by one. Martin
On Sun, 2023-10-08 at 08:03 +1300, Michael Hamilton wrote:
Bumping this important thread.
I think the chikens are coming home to roost on this change, there seem to be quite a few people reporting boot issues.
Do you have anything concrete? If yes, please provide links or hints. I'm looking myself, but as there are multiple ways in which people can report issues, I might overlook something. Thanks, Martin
On Tuesday 10 October 2023, Martin Wilck via openSUSE Factory wrote:
On Sun, 2023-10-08 at 08:03 +1300, Michael Hamilton wrote:
Bumping this important thread.
I think the chikens are coming home to roost on this change, there seem to be quite a few people reporting boot issues.
Do you have anything concrete? If yes, please provide links or hints. I'm looking myself, but as there are multiple ways in which people can report issues, I might overlook something.
Thanks, Martin
The main thing that bit me was that the order of /dev/sd* is no longer fixed which caused problems for my boot configuration and drive configuration scripts (hdparm/smartctrl post-boot scripts). 0) My symptom is being unable to boot (I think people reporting slow boots are experiencing a different problem). 1) Make sure /dev/sd* isn't used in /etc/fstab. 2) Check /etc/default/grub, make sure GRUB_DISABLE_LINUX_UUID is commented out (normal case). 3) Check /boot/grub2/grub.cfg and make sure it doesn't reference /dev/sd* in other ways. 4) Track down any scripts that set parameters for any /dev/sd* devices. When dealing with #4 above, I'm not sure what hdparm and smartctrl accept, so for the moment I'm using code that extracts the /dev/sd name from the output of lsscsi command. There is a discussion going on in the forums about how to restore stability to the ordering (for those that don't want to change or want a short term fix), see: https://forums.opensuse.org/t/problem-with-disks-order-after-snapshot-202309... Which suggests adding scsi_mod.async_probe=0 to the bootline. Would this be the same for nvme or does scsi cover that?. The suggestion seems to work. As you suggested, adding udev.scsi_symlink_src=LTVS udev.scsi_id_serial_src=LTVS to the bootline also seems to result in the old ordering being used. With both suggestions above, I haven't done enough boots to be sure of stability of the /dev/sd* assignments. But either is worth a go to get back into the system and fix the /dev/sd* references. Cheers, Michael
On Tue, 2023-10-10 at 07:53 +1300, Michael Hamilton wrote:
The main thing that bit me was that the order of /dev/sd* is no longer fixed which caused problems for my boot configuration and drive configuration scripts (hdparm/smartctrl post-boot scripts).
That shouldn't have anything to do with the recent change. The order of /dev/sd* has been non-deterministic at least since kernel commit f049cf1a7b67 ("scsi: sd: Rely on the driver core for asynchronous probing"), which was in upstream kernel 5.3, 4 years ago. This kernel has been in Leap since 15.2. But even before that, the order has never been truly deterministic, even if it might have seemed to be the case on some systems. The order has always been dependent on the order of SCSI drivers loaded and the order of controllers on the PCI bus. Bottom line: don't use /dev/sd* for referring to specific devices in configuration files, ever.
0) My symptom is being unable to boot (I think people reporting slow boots are experiencing a different problem). 1) Make sure /dev/sd* isn't used in /etc/fstab. 2) Check /etc/default/grub, make sure GRUB_DISABLE_LINUX_UUID is commented out (normal case). 3) Check /boot/grub2/grub.cfg and make sure it doesn't reference /dev/sd* in other ways. 4) Track down any scripts that set parameters for any /dev/sd* devices.
Again, the recent sg3_utils change has nothing to do with /dev/sd*. It only removes certain symlinks under /dev/disk/by-id.
When dealing with #4 above, I'm not sure what hdparm and smartctrl accept, so for the moment I'm using code that extracts the /dev/sd name from the output of lsscsi command.
There is a discussion going on in the forums about how to restore stability to the ordering (for those that don't want to change or want a short term fix), see:
https://forums.opensuse.org/t/problem-with-disks-order-after-snapshot-202309...
Which suggests adding scsi_mod.async_probe=0 to the bootline. Would this be the same for nvme or does scsi cover that?. The suggestion seems to work.
Strange. This is wrong syntax. The syntax of the parameter is "scsi_mod.disable_async_probing=<driver>". See https://www.suse.com/support/kb/doc/?id=000018449 This parameter exists only on Leap, not on TW (and probably not on ALP). nvme is totally different, there is no corresponding parameter.
As you suggested, adding udev.scsi_symlink_src=LTVS udev.scsi_id_serial_src=LTVS to the bootline also seems to result in the old ordering being used.
This finding is even stranger. The /dev/sd* names are assigned by the kernel. sg3_utils or udev have nothing to do with it. The only reason I can think of is that the timing during boot is somehow affected by the different udev rules.
With both suggestions above, I haven't done enough boots to be sure of stability of the /dev/sd* assignments. But either is worth a go to get back into the system and fix the /dev/sd* references.
Sorry, that won't happen, see above. I'm surprised that you haven't run into issues with this approach years ago. The fact that this hits you now must be some weird coincidence that must be inspected further. Please open a bug and assign it to me. Thanks, Martin
On Tuesday 10 October 2023, Martin Wilck via openSUSE Factory wrote:
On Tue, 2023-10-10 at 07:53 +1300, Michael Hamilton wrote:
The main thing that bit me was that the order of /dev/sd* is no longer fixed which caused problems for my boot configuration and drive configuration scripts (hdparm/smartctrl post-boot scripts).
That shouldn't have anything to do with the recent change. The order of /dev/sd* has been non-deterministic at least since kernel commit f049cf1a7b67 ("scsi: sd: Rely on the driver core for asynchronous probing"), which was in upstream kernel 5.3, 4 years ago. This kernel has been in Leap since 15.2. But even before that, the order has never been truly deterministic, even if it might have seemed to be the case on some systems. The order has always been dependent on the order of SCSI drivers loaded and the order of controllers on the PCI bus.
Bottom line: don't use /dev/sd* for referring to specific devices in configuration files, ever.
0) My symptom is being unable to boot (I think people reporting slow boots are experiencing a different problem). 1) Make sure /dev/sd* isn't used in /etc/fstab. 2) Check /etc/default/grub, make sure GRUB_DISABLE_LINUX_UUID is commented out (normal case). 3) Check /boot/grub2/grub.cfg and make sure it doesn't reference /dev/sd* in other ways. 4) Track down any scripts that set parameters for any /dev/sd* devices.
Again, the recent sg3_utils change has nothing to do with /dev/sd*. It only removes certain symlinks under /dev/disk/by-id.
A lot of scripts/configs TW have been working fine for years, so what ever has perturbed the allocation of /dev/sd* is recent, but it might not be anything to do with sg3_utils. It was just that it was the most recent major announcement that mentions a similar topic.
When dealing with #4 above, I'm not sure what hdparm and smartctrl accept, so for the moment I'm using code that extracts the /dev/sd name from the output of lsscsi command.
There is a discussion going on in the forums about how to restore stability to the ordering (for those that don't want to change or want a short term fix), see:
https://forums.opensuse.org/t/problem-with-disks-order-after-snapshot-202309...
Which suggests adding scsi_mod.async_probe=0 to the bootline. Would this be the same for nvme or does scsi cover that?. The suggestion seems to work.
Strange. This is wrong syntax. The syntax of the parameter is "scsi_mod.disable_async_probing=<driver>". See https://www.suse.com/support/kb/doc/?id=000018449
Thanks. I've found that the stable assignment did not survive a cold boot (maybe I was just "lucky"). So, yes, this doesn't work. I've tried the syntax you suggested. I didn't know what to put for driver, after Googling, I used: % udevadm info -a -n /dev/sda | grep -oP 'DRIVERS?=="\K[^"]+' sd ahci pcieport And added the following to the boot line: scsi_mod.disable_async_probing=sd,ahci,pcieport This appeared to work for a couple of cold boots, the drives all got mapped as they used to be. But it failed to work on a reboot without powering down. But then it worked on a following cold boot. So I don't think this works, or at least not reliably.
This parameter exists only on Leap, not on TW (and probably not on ALP). nvme is totally different, there is no corresponding parameter.
As you suggested, adding udev.scsi_symlink_src=LTVS udev.scsi_id_serial_src=LTVS to the bootline also seems to result in the old ordering being used.
This finding is even stranger. The /dev/sd* names are assigned by the kernel. sg3_utils or udev have nothing to do with it. The only reason I can think of is that the timing during boot is somehow affected by the different udev rules.
Probably things were stable because I was not cold booting, or was "lucky".
With both suggestions above, I haven't done enough boots to be sure of stability of the /dev/sd* assignments. But either is worth a go to get back into the system and fix the /dev/sd* references.
Sorry, that won't happen, see above. I'm surprised that you haven't run into issues with this approach years ago. The fact that this hits you now must be some weird coincidence that must be inspected further. Please open a bug and assign it to me.
Bug: https://bugzilla.opensuse.org/show_bug.cgi?id=1216070 I didn't attribute the bug to sg3_utils, I don't think we're sure of that.
Thanks, Martin
Thanks for explaining all the above. I will see if I can try an older kernel (just out of interest, I've already expunged /dev/sd from being referenced persistently). Michael
On Tue, 2023-10-10 at 12:55 +1300, Michael Hamilton wrote:
There is a discussion going on in the forums about how to restore stability to the ordering (for those that don't want to change or want a short term fix), see:
https://forums.opensuse.org/t/problem-with-disks-order-after-snapshot-202309...
Which suggests adding scsi_mod.async_probe=0 to the bootline. Would this be the same for nvme or does scsi cover that?. The suggestion seems to work.
Strange. This is wrong syntax. The syntax of the parameter is "scsi_mod.disable_async_probing=<driver>". See https://www.suse.com/support/kb/doc/?id=000018449
Thanks. I've found that the stable assignment did not survive a cold boot (maybe I was just "lucky"). So, yes, this doesn't work.
I've tried the syntax you suggested. I didn't know what to put for driver, after Googling, I used:
% udevadm info -a -n /dev/sda | grep -oP 'DRIVERS?=="\K[^"]+' sd ahci pcieport
And added the following to the boot line: scsi_mod.disable_async_probing=sd,ahci,pcieport
The SCSI host adapter driver is It's "ahci" in your case. To find out, walk up the hierarchy using "udevadm info -a", as you already did, and look out for the first parent of the SCSI host ("host$N") which has a non-empty "DRIVERS" attribute.
This appeared to work for a couple of cold boots, the drives all got mapped as they used to be. But it failed to work on a reboot without powering down. But then it worked on a following cold boot. So I don't think this works, or at least not reliably.
`scsi_mod.disable_async_probing=ahci` affects the ordering for a given SCSI host (i.e. SATA port) using the ahci driver. This still doesn't guarantee a stable global ordering, because the ordering of SCSI hosts remains unreliable. It depends on the order in which drivers are loaded (e.g. USB devices may be detected first). You can try to enforce a loading order for SCSI drivers using modprobe.d, like this for usb: softdep usb_storage pre: ahci Furthermore, the SCSI host ordering depends on the discovery of the upper level devices such as PCI devices and SATA ports, which may or may not be deterministic.
Probably things were stable because I was not cold booting, or was "lucky".
Yes ;-)
Bug:
Thanks. Martin
On Tuesday 10 October 2023, Martin Wilck via openSUSE Factory wrote:
....
`scsi_mod.disable_async_probing=ahci` affects the ordering for a given SCSI host (i.e. SATA port) using the ahci driver. This still doesn't guarantee a stable global ordering, because the ordering of SCSI hosts remains unreliable. It depends on the order in which drivers are loaded (e.g. USB devices may be detected first). You can try to enforce a loading order for SCSI drivers using modprobe.d, like this for usb:
softdep usb_storage pre: ahci
Furthermore, the SCSI host ordering depends on the discovery of the upper level devices such as PCI devices and SATA ports, which may or may not be deterministic.
I don't work with booting very often, so to confirm the procedure, it should be something like (as root): % echo 'softdep usb_storage pre: ahci' >/etc/modprobe.d/10-ahci-scsi.conf % dracut -f --regenerate-all I first just ran dracut -f and the boot order still varied. But after adding the --regenerate-all both a hot and cold boot reverted to the old ordering. Others reading this might also want to note that mkinitrd is no longer available, so do use dracut. I know the order shouldn't be relied on, but I think for many normal desktop PC's it's probably worked OK since v0.10 in 1992, so this is bound to trip up a few people, at least until all the examples on the web and in various config files get updated (probably never going to happen). It's good to have a documented work around. Thanks for sorting this out. Michael
On Wednesday 11 October 2023, Michael Hamilton wrote:
On Tuesday 10 October 2023, Martin Wilck via openSUSE Factory wrote:
....
`scsi_mod.disable_async_probing=ahci` affects the ordering for a given SCSI host (i.e. SATA port) using the ahci driver. This still doesn't guarantee a stable global ordering, because the ordering of SCSI hosts remains unreliable. It depends on the order in which drivers are loaded (e.g. USB devices may be detected first). You can try to enforce a loading order for SCSI drivers using modprobe.d, like this for usb:
softdep usb_storage pre: ahci
Furthermore, the SCSI host ordering depends on the discovery of the upper level devices such as PCI devices and SATA ports, which may or may not be deterministic.
I don't work with booting very often, so to confirm the procedure, it should be something like (as root):
% echo 'softdep usb_storage pre: ahci' >/etc/modprobe.d/10-ahci-scsi.conf % dracut -f --regenerate-all
I first just ran dracut -f and the boot order still varied. But after adding the --regenerate-all both a hot and cold boot reverted to the old ordering. Others reading this might also want to note that mkinitrd is no longer available, so do use dracut.
I know the order shouldn't be relied on, but I think for many normal desktop PC's it's probably worked OK since v0.10 in 1992, so this is bound to trip up a few people, at least until all the examples on the web and in various config files get updated (probably never going to happen). It's good to have a documented work around.
Thanks for sorting this out.
Michael
For anyone interested in this thread, there is now a bug report in which Martin Wilck has provided a more definitive work around: https://bugzilla.opensuse.org/show_bug.cgi?id=1216070#c14 Basically, as root: echo sd_mod > /etc/modules-load.d/sd_mod.conf dracut -f --regenerate-all Loading sd_mod earlier seems to revert the detection order. This work around has worked for every boot (at least 10 boots) since I put it in place. The proper solution is to remove permanent references to /dev/sd* from where ever they might be found. In my case /etc/fstab was already mostly using labels (just had to label swap), but grub wasn't using UUIDs. I've also had to track down scripts using hdparm and smartctrl which I've changed to refer to whole disks by /dev/disk/by-id/*, for example: find /dev/disk/by-id/ -type l \! -name '*-part[0-9]' | grep SATA I'm still going to use the work around. The ordering has been fixed/deterministic on my desktops for 30+ years. I prefer to see a more fixed output when I do df, lsblk, and similar. Michael
[QUOTE] I don't work with booting very often, so to confirm the procedure, it should be something like (as root): % echo 'softdep usb_storage pre: ahci' >/etc/modprobe.d/10-ahci-scsi.conf % dracut -f --regenerate-all I first just ran dracut -f and the boot order still varied. But after adding the --regenerate-all both a hot and cold boot reverted to the old ordering. Others reading this might also want to note that mkinitrd is no longer available, so do use dracut.[/QUOTE] That did not appear to help my Leap 15.6 grub ordering . . . copy/pasted those two commands and dracut ran through a long list of items. After that I ran grub2-mkconfig and the "sdb" and "sdc" drives were still "reversed." Ran "update-bootloader" . . . then there were 325 packages with new devel kernel 6.5.7 -lp155 . . . ran them through, again ran "update-bootloader" and shut down. On cold boot the system in sdb were showing in sdc and vice versa . . . . Looking for the silver bullet on this grub "disordering" problem. F
On 09.10.2023 22:21, Martin Wilck via openSUSE Factory wrote:
That shouldn't have anything to do with the recent change. The order of /dev/sd* has been non-deterministic at least since kernel commit f049cf1a7b67 ("scsi: sd: Rely on the driver core for asynchronous probing"), which was in upstream kernel 5.3, 4 years ago.
Is not (S)ATA probing done by ATA layer? Is this change relevant for the consumer systems without true SCSI controllers at all? If yes, which driver should be referenced in this case?
This kernel has been in Leap since 15.2. But even before that, the order has never been truly deterministic, even if it might have seemed to be the case on some systems. The order has always been dependent on the order of SCSI drivers loaded and the order of controllers on the PCI bus.
Yes. And the problem in the forum thread referenced here was caused by making SCSI drivers modules which made them be loaded after USB. https://lists.opensuse.org/archives/list/kernel@lists.opensuse.org/thread/ML...
On Tue, 2023-10-10 at 07:09 +0300, Andrei Borzenkov wrote:
On 09.10.2023 22:21, Martin Wilck via openSUSE Factory wrote:
That shouldn't have anything to do with the recent change. The order of /dev/sd* has been non-deterministic at least since kernel commit f049cf1a7b67 ("scsi: sd: Rely on the driver core for asynchronous probing"), which was in upstream kernel 5.3, 4 years ago.
Is not (S)ATA probing done by ATA layer? Is this change relevant for the consumer systems without true SCSI controllers at all? If yes, which driver should be referenced in this case?
If there's just one SCSI device attached to any given ATA port, it might indeed not matter. But in general, the probing policy for the "sd_mod" driver affects ordering of sd devices attached to any given SCSI host, and SATA ports correspond to SCSI hosts.
This kernel has been in Leap since 15.2. But even before that, the order has never been truly deterministic, even if it might have seemed to be the case on some systems. The order has always been dependent on the order of SCSI drivers loaded and the order of controllers on the PCI bus.
Yes. And the problem in the forum thread referenced here was caused by making SCSI drivers modules which made them be loaded after USB.
https://lists.opensuse.org/archives/list/kernel@lists.opensuse.org/thread/ML...
Right. And this change has been effective since kernel 6.5.4, which reached tumbleweed 230921, one day before I made the "SCSI device identification" announcement which started this thread. Martin
Hello, Am Montag, 9. Oktober 2023, 20:53:34 CEST schrieb Michael Hamilton:
The main thing that bit me was that the order of /dev/sd* is no longer fixed which caused problems for my boot configuration and drive configuration scripts (hdparm/smartctrl post-boot scripts).
0) My symptom is being unable to boot (I think people reporting slow boots are experiencing a different problem). 1) Make sure /dev/sd* isn't used in /etc/fstab.
I'd recommend to also check /etc/crypttab I had some random mount failures for my encrypted data partition in the last weeks, and now realized that it was listed as /dev/sdb1 in /etc/crypttab. (I now changed it to UUID=... - but since the mount failures only happened rarely, it will take a while to confirm that this was really the issue.) Regards, Christian Boltz -- Looks like I got the wrong address and took my car to the hairdresser. [Niki Kovacs in https://bugzilla.opensuse.org/show_bug.cgi?id=1200583]
On Mon, 2023-10-09 at 21:35 +0200, Christian Boltz wrote:
Hello,
Am Montag, 9. Oktober 2023, 20:53:34 CEST schrieb Michael Hamilton:
The main thing that bit me was that the order of /dev/sd* is no longer fixed which caused problems for my boot configuration and drive configuration scripts (hdparm/smartctrl post-boot scripts).
0) My symptom is being unable to boot (I think people reporting slow boots are experiencing a different problem). 1) Make sure /dev/sd* isn't used in /etc/fstab.
I'd recommend to also check /etc/crypttab
Right. I'd be grateful for a list of common places where such references occur. Quick brainstorm: - /etc/fstab - /etc/crypttab - /etc/lvm/lvm.conf - /etc/mdadm.conf - /etc/smartd.conf - /etc/hdparm.conf (does anyone use this? seems to be Debian-only?) - /etc/default/grub and /boot/grub2/grub.cfg - libvirt XML related to storage Did I forget anything important? Note that in almost all cases (especially in fstab), file system labels or UUIDs would be preferred. Martin
On Mon, 2023-10-09 at 21:48 +0200, Martin Wilck wrote:
Right. I'd be grateful for a list of common places where such references occur. Quick brainstorm:
- /etc/fstab - /etc/crypttab - /etc/lvm/lvm.conf - /etc/mdadm.conf - /etc/smartd.conf - /etc/hdparm.conf (does anyone use this? seems to be Debian-only?) - /etc/default/grub and /boot/grub2/grub.cfg - libvirt XML related to storage
Did I forget anything important?
I have put an extended list here: https://en.opensuse.org/SDB:SCSI_Device_Identification#Where_to_look_for_the... Notify me if something else comes to your mind. Martin
On 10/9/23 11:00, Martin Wilck via openSUSE Factory wrote:
On Sun, 2023-10-08 at 08:03 +1300, Michael Hamilton wrote:
Bumping this important thread.
I think the chikens are coming home to roost on this change, there seem to be quite a few people reporting boot issues. Do you have anything concrete? If yes, please provide links or hints. I'm looking myself, but as there are multiple ways in which people can report issues, I might overlook something.
Thanks, Martin
Hi Martin, I did not enounter any boot like issues because I use UUID in grub and LABEL in fstab, however, when I saw your original message about this I did not expect it have any effect on my system since I no longer have any SCSI drives. I have 4 SATA SSDs and 1 SATA HDD. What I did find is that the /dev/sdX mappings changed for my devices which is the first time they have done this since I started running TW several years ago. It did affect some stuff I wrote which sorted by /dev/sdX and where I expected that they would map to SATA PORTS 1-5 ( port 1 /dev/sda ) like they have always been in the past even though that behavior was not guaranteed. -- Regards, Joe
On Mon, 2023-10-09 at 15:23 -0400, Joe Salmeri wrote:
Hi Martin,
I did not enounter any boot like issues because I use UUID in grub and LABEL in fstab, however, when I saw your original message about this I did not expect it have any effect on my system since I no longer have any SCSI drives. I have 4 SATA SSDs and 1 SATA HDD.
What I did find is that the /dev/sdX mappings changed for my devices which is the first time they have done this since I started running TW several years ago.
Hm. As I said in my reply to Michael, I am puzzled. The /dev/sd* assignments are made by the kernel before udev even gets to see the devices. I have no clue how changing the udev rules affects the /dev/sd* device node names. I'd need to see detailed boot logs (possibly with "udev.log_priority=debug") and "udevadm info -e" output to figure out what's going on. (But that should be done in bugzilla or on the forum; I've just started reading through https://forums.opensuse.org/t/problem-with-disks-order-after-snapshot-202309... ) As I said in the original post, (S)ATA drivers hould have /dev/disk/by-id/ata-* identifiers, which are stable and unaffected by the recent change in sg3_utils. Regards Martin
@Joe: The changing in the grub mapping or "ordering" is where we are sharing the problem. The SCSI drive part I wasn't getting, that goes back to I think one of my '00 era Macs that has that kind of drive connector?? In my recent problems with TW and now Leap and grub "ordering" that machine is SATA only . . . . So, my understanding is that whatever Martin is doing or discussing would not relate to my personal problems with openSUSE????
Fritz Hudnut composed on 2023-10-09 21:00 (UTC):
The changing in the grub mapping or "ordering" is where we are sharing the problem. The SCSI drive part I wasn't getting, that goes back to I think one of my '00 era Macs that has that kind of drive connector??
# lsmod | egrep 'ata|scsi' libata 380928 2 libahci,ahci scsi_mod 282624 6 sd_mod,usb_storage,uas,libata,sg,sr_mod # SCSI and ATA have been intertwined IIRC for more than 20 years. Libata IIRC was the means by, or corollary to which, disk device names were changed from hdX (hard disk) to sdX (scsi disk). SCSI per-disk partition count had been limited to 16, and the switch caused a disruptive application of the same limit to ATA disks with more than 16 that took quite some time for some people, particularly multibooters, to overcome. -- Evolution as taught in public schools is, like religion, based on faith, not based on science. Team OS/2 ** Reg. Linux User #211409 ** a11y rocks! Felix Miata
@ et al: Running that command in my **Leap 15.6** system, which is now the grub "controller" for my '12 MacPro, I get: [QUOTE] # lsmod | egrep 'ata|scsi' libata 512000 2 libahci,ahci scsi_dh_rdac 16384 0 scsi_dh_emc 12288 0 scsi_dh_alua 24576 0 scsi_mod 348160 8 scsi_dh_emc,sd_mod,dm_multipath,scsi_dh_alua,libata,sg,scsi_dh_rdac,sr_mod scsi_common 16384 5 scsi_mod,sd_mod,libata,sg,sr_mod [/QUOTE] I still don't know whether my problems with booting different grub line items relates to this thread . . . following along with it, for awhile.
On Sun, 2023-10-08 at 08:03 +1300, Michael Hamilton wrote:
On Friday 29 September 2023, Martin Wilck via openSUSE Factory wrote:
On Fri, 2023-09-22 at 00:22 +0200, Martin Wilck via openSUSE Factory wrote:
With the forthcoming update of sg3_utils to v1.48 (currently in Base:System, to be submitted to Factory soon), SCSI device identification and SCSI symlink generation by udev will change.
TL;DR: unreliable device identifiers will not be used by default any more, and /dev/disk/by-id symlinks for such identifiers will not be created by default, either. The policy can be changed by editing the udev rules in 00-scsi-sg3_config.rules.
To make sure this information isn't buried in the mailing list, I've created https://en.opensuse.org/SDB:SCSI_Device_Identification
I've added kernel boot parameters as a means for rescue, in case this should cause regeression (I don't expect it will, but well...). The sg3_utils 1.48 update for factory is planned for week 41.
Regards Martin
Bumping this important thread.
I think the chikens are coming home to roost on this change, there seem to be quite a few people reporting boot issues.
Everybody: if you think you're hit by a regression caused by the sg3_utils update discussed in this thread, please try booting with the kernel parameters udev.scsi_symlink_src=LTVS udev.scsi_id_serial_src=LTVS See also the Wiki article linked above. This change has only been submitted to Factory so far, no other (open)SUSE distribution (SLE/Leap/ALP) is affected. Martin
@Martin: I'm posting over on this thread so as to not spam the bug report . . . . Thanks for mentioning my name over there in regards to my "grub2" issues and inviting me to file a new bug report, etc. Up until this morning I thought my problems with grub2 were resolved . . . . However, this morning I booted my TW install, which the / directory is in one drive and /home is in another . . . and a strange "hybrid" system that was "TW" but was also showing "ubuntu" data, and failing to recognize "zypper," suggesting I could install it via "apt install zypper" . . . . Happened. I rebooted and picked the oldest kernel installed and was able to run a "zypper" to install 498 packages . . . after which the reboot seemed to bring TW only to the GUI. That seemed like maybe an "sdX" misread problem . . . possibly relating to the now closed bug. But, I'll monitor it for a bit to see if it recurs . . . . F
participants (7)
-
Andrei Borzenkov
-
Christian Boltz
-
Felix Miata
-
Fritz Hudnut
-
Joe Salmeri
-
Martin Wilck
-
Michael Hamilton