[Bug 304657] New: Yast2 repair doesn' t check or repair system installed on MD Raid
https://bugzilla.novell.com/show_bug.cgi?id=304657 Summary: Yast2 repair doesn't check or repair system installed on MD Raid Product: openSUSE 10.3 Version: Beta 2 Platform: i686 OS/Version: openSUSE 10.3 Status: NEW Severity: Major Priority: P5 - None Component: YaST2 AssignedTo: bnc-team-screening@forge.provo.novell.com ReportedBy: rccj@ricreig.com QAContact: jsrain@novell.com Found By: Beta-Customer Created an attachment (id=159847) --> (https://bugzilla.novell.com/attachment.cgi?id=159847) Folder containing several supporting files and screenshots I installed Beta 2 on a MDRaid system in its entirety IAW the HowTo on the OpenSUSE.org website starting with Alpha 5. Upgrading to Beta 2, I wanted to verify the installation so I tried the INSTALLATION REPAIR on the CD. It tried to repair everything EXCEPT the installed beta 2 system. I had a 10.2 system on a IDE drive that it went after and it ignored the MD partitions but tried to mount the drives making up the MD array. It also tried to modify the 10.2 GRUB and FSTAB. I rebooted the 10.3 and ran yast2-repair which did the same thing and I was able to document what it looked like as an attachment. I am attaching my partition tables (note the 10.2 drive was NOT mounted) the fstab of the running 10.3 system, the 10.3 grub files and various screen shots of the 'repair' sequence showing that it initially knew about the fact that the md0,1,2,3 partitions existed and were mounted and the sda partitions were not yet wnen it started doing fsck tests, it started doing it on the wrong drives including the drives associated with the wrong OS version, not the 10.3 version on the MD raid partitions. The first time this happened, I did it in 'auto' and it 'fixed' everything "real good". I spent several days unfixing it. It ruined a perfecly good 10.2 installation. For this test I skipped or refused all repairs but the repair program WILL NOT allow an ABORY which I consider another bug. -- Configure bugmail: https://bugzilla.novell.com/userprefs.cgi?tab=email ------- You are receiving this mail because: ------- You are on the CC list for the bug.
https://bugzilla.novell.com/show_bug.cgi?id=304657#c1
Matej Horvath
https://bugzilla.novell.com/show_bug.cgi?id=304657#c2
--- Comment #2 from Richard Creighton
https://bugzilla.novell.com/show_bug.cgi?id=304657#c3
Richard Creighton
https://bugzilla.novell.com/show_bug.cgi?id=304657#c4
--- Comment #4 from Richard Creighton
https://bugzilla.novell.com/show_bug.cgi?id=304657
Andreas Jaeger
https://bugzilla.novell.com/show_bug.cgi?id=304657
Steffen Winterfeldt
https://bugzilla.novell.com/show_bug.cgi?id=304657
Katarina Machalkova
https://bugzilla.novell.com/show_bug.cgi?id=304657#c5
Jiří Suchomel
https://bugzilla.novell.com/show_bug.cgi?id=304657#c6
Richard Creighton
--- Comment #5 from JiřàSuchomel
2007-08-29 00:21:01 MST --- Sorry, but not everything is clear from the report. You mention that repair broke your settings. However I doubt it did any real change without your confirmation
The first time it was run from the automatic mode and it was talking about SDx drives not HDx which to me elimnated the IDE drive and I was not expecting it to be writing to the wrong drive so I answered 'Y' like I have in the past for other repairs.
Did you run repair system always from the installaton media or from the installed system? The correct way is to run it from media; before you finish it,
Attachment of comment 2 doesn't mention any run of repair. I do not understand from the report how did you aquire it.
After the 'repair'', neither the 10.3 install, nor my original 10.2 system on the IDE drive which should have been hda would boot. (I wasn't aware that 10.3 renamed the IDE drives to 'SDn'.) and one of my 'Y' answers to the automatic repair from the installation media had written changes to the GRUB configuration on the 10.2 installition and not on the 10.3 drive set on the MD file structure as I had assumed it would. Thus a KNOPPIX boot to a root session. So, after almost a day of manually figuring out what had happened, discovering the HD to SD naming change, discovering the change to the 10.2 GRUB files, figuring out what they *should* have been and manually rewriting them from a KNOPPIX session, I then backed up the /boot partition "in case". I got the installation that had been installed on the MD raid to boot by booting from the install CD and selecting the option to boot the installed OS from the install menu options, (thereby bypassing GRUB) the installation (10.3 b2) continued normally and I was able to log in and with the exception of the drive renaming, all was apparantly normal except for the inability to directly boot. I fixed the GRUB files (on the MD raid /boot) and verified I could boot either the 10.2 or the 10.3b2 OS and I could (from BIOS) boot either OS first and using the GRUB on that OS, choose which OS I wanted to boot. So, I can boot off of the IDE drive or the MD raid which now leaves me ready to test YaST2 - repair. Now, to answer your question. First, I backed up my 10.2 system /boot partition and re-ran the test from the install disk again and it broke the same way again but this time, KNOPPIX and a copy of the backup fixed things rapidly leaving how to show you what was happening. What I did was to go ahead and boot 10.3 and run "yast2 repair" as root so I could screen print the sesssion which I couldn't figure out how to do from the installation disk repair session. Thus, this truly is a YaST2 repair session running from a running, booted 10.3b2 MD raid installed system.
When yast2-repair detects more installation, it should offer you a selection for which one to check/repair. If it wasn't shown, it may be because the RAID-based systems are not fully supported by repair module
Which is precicely the object of this report. Like you, I am trying to help make SuSE the best product possible. Installation of the entire OS is documented as being possible as of 10.2 on a MD raid system by documentation on the web site (I'll get you the openSUSE URL if you don't have it but it isn't handy at the moment) and it does work as the 10.3b2 system will and does boot and run completely from the MD raid system (even with the IDE drive electrically disconnected). Given that the OS obviously does fully support running from a MD raid installation, the repair module should fully support that installation, which is the point of the report even though I had to 'cheat' to get the pictures of the failure, but I assure you it looked essentially the same when it was happening without the benefit of the KDE screen print functions. -- Configure bugmail: https://bugzilla.novell.com/userprefs.cgi?tab=email ------- You are receiving this mail because: ------- You are on the CC list for the bug.
https://bugzilla.novell.com/show_bug.cgi?id=304657#c7
--- Comment #7 from Richard Creighton
When yast2-repair detects more installation, it should offer you a selection for which one to check/repair. If it wasn't shown, it may be because the RAID-based systems are not fully supported by repair module
I overlooked the first part of your statement in my last reply. Yes, if it finds more than one valid (damaged or otherwise) installation, it should offer a choice of which one it should try to repair regardless of the media upon which it is installed, and it did not do this which I hope the screen shots conveyed. -- Configure bugmail: https://bugzilla.novell.com/userprefs.cgi?tab=email ------- You are receiving this mail because: ------- You are on the CC list for the bug.
https://bugzilla.novell.com/show_bug.cgi?id=304657#c8
Jiří Suchomel
Given that the OS obviously does fully support running from a MD raid installation, the repair module should fully support that installation, which is the point of the report
Yes, there's obvious logic in this requirement, but in fact, repair module is regarded as a helper for common situation instead of tool able to fix everything done in installation. I'll try to look into the logs and see what can be done. -- Configure bugmail: https://bugzilla.novell.com/userprefs.cgi?tab=email ------- You are receiving this mail because: ------- You are on the CC list for the bug.
https://bugzilla.novell.com/show_bug.cgi?id=304657#c9
--- Comment #9 from Jiří Suchomel
https://bugzilla.novell.com/show_bug.cgi?id=304657#c10
--- Comment #10 from Richard Creighton
Well, it is true that "Repair" is able to some nasty things to your system, which is the dark side of the fact that it is (at least in most and common cases) able to repair it. And I understand your frustration after the broken repair, but there is no other option than asking user to confirm possible dangerous action.
However, I still think that at least bootloader configuration should be quite easy to restore: if you'd boot installation CD of 10.3, you could select Repair just there (I think at the Installation Mode screen), and once repair module is loaded, in Expert settings it is possible to run bootloader configuration which should be able to propose the correct setting. Expert settings also allow calling partitioner module, and maybe it would be possible to recover your system from that point, but I'm not sure if it would be enough.
Yes, I agree, however in this case I was never presented with that option. I was initially notified that there was my 10.2 drive (which it correctly identified as not mounted) and it discovered the MD0 to MD3 drives as well which can be seen on one of the screen shots but does not mention file systems or offerss of repair options. It directly goes into FSCK of the raw devices and never does a FSCK or other check of the MD raid devices. During the FSCK of the devices comprising the MDx devices, it reports many errors, yet it 'knew' the type was part of a raid and should not have run the test in the first place so that is one bug right there. The second bug is it did NOT detect the 2nd FS,and immediately went to work on the only one it did find which was the 10.2 FS and not the desired 10.3 FS on the MD raid device. So not only did it not offer to work on the correct FS, it apparantly did not even scan for it or detect it even though it is obvious it knew the raid device existed because it presented the info in the screenshot.
Given that the OS obviously does fully support running from a MD raid installation, the repair module should fully support that installation, which is the point of the report
Yes, there's obvious logic in this requirement, but in fact, repair module is regarded as a helper for common situation instead of tool able to fix everything done in installation.
I'll try to look into the logs and see what can be done.
I really appreciate the work you are doing. The repair function is primarily for those that are inexperienced and their systems are malfunctioning and they need help and when it makes things worse, it makes SuSE look bad as a product. I have been supporting Linux since almost it was first released and truly want it to succeed and provide an alternative to MS, but until the average non technical person can safely expect 'REPAIR' to do just that on the install disk of their chosen distribution, I am afraid Linux/SuSE will remain a 3rd world wannabe and that is a shame that I am working to overcome. -- Configure bugmail: https://bugzilla.novell.com/userprefs.cgi?tab=email ------- You are receiving this mail because: ------- You are on the CC list for the bug.
https://bugzilla.novell.com/show_bug.cgi?id=304657#c11
--- Comment #11 from Richard Creighton
https://bugzilla.novell.com/show_bug.cgi?id=304657#c12
Jiří Suchomel
Initially I received an UPDATE ...
If you want to report any other problem than this one about repair, file a separate bug.
I took screenshots but it does not save them on disk, but apparently in ram so > notknowing that, they were lost.
See comment 9 about how to save the logfiles or screenshot from that session.
This screenshot is VERY interesting because it shows that it knew about MD1 and MD3 but immediately it also shows that it intended to do file system checks on sda1 and sdb2 among others which are NOT part of the installed Beta 3 system if you will refer to the PartTbl10.3.png. Those devices are part of the 10.2 file system shown also in Partn10.2.png.
How is that possible? PartTbl10.3.png shows that sdb1 is used by md1, sdb2 by md0 etc. Thomas: Storage::GetTargetMap reports that /dev/sdb2 in devices list of /dev/md0: $["chunk_size":32, "detected_fs":`swap, "device":"/dev/md0", "devices":["/dev/sdb2", "/dev/ sdc2"], "fstopt":"defaults", "fstype":"MD Raid", "label":"swap", "mount":"swap", "mountby":`label, "name":"md0", "nr":0, "raid_type":"raid0", "size_k":2104320, "type":`sw_raid, "used_fs":`swap] -- Configure bugmail: https://bugzilla.novell.com/userprefs.cgi?tab=email ------- You are receiving this mail because: ------- You are on the CC list for the bug.
https://bugzilla.novell.com/show_bug.cgi?id=304657#c13
--- Comment #13 from Jiří Suchomel
https://bugzilla.novell.com/show_bug.cgi?id=304657#c14
--- Comment #14 from Richard Creighton
Created an attachment (id=164079) --> (https://bugzilla.novell.com/attachment.cgi?id=164079) [details] patch for /usr/share/YaST2/modules/OSRFstab.ycp
Richard, here is my first attempt: use this patch on /usr/share/YaST2/modules/OSRFstab.ycp and run 'ycpc -c /usr/share/YaST2/modules/OSRFstab.ycp'. Than run 'yast2 repair' and attach the log files from your "check session" (I do not advise you to do any repairs, but I hope you would not need them anyway)
Since we last 'spoke', I totally reinstalled 10.2 on that DMraid machine then ran 10.3 upgrade from the B3 DVD so the environment has changed somewhat, but the upgrade still failed but I was afraid to try the repair because last time it almost ruined my 10.2 system on my IDE drive. This machine has 9 drives 4 of which are only accessable by a raid controller that as yet won't run with kernels past 2.6.18. I had thought that drive safe previously (and under 10.2 it had been) but with the SATA/IDE rename thing in 10.3, everything has been subject to damage including the MBR of that supposedly immune IDE drive. Now, I am willing to risk it again now that I know I *can* recover but I am NOT a programmer so I am not real sure how to apply the patch you have provided Im I correct in assuming that there is a program available to me as root called 'ycpc' that I can run and use the patch as an argument? Then re-run yast2 repair an capture the log file for you. If that is correct, I will be happy to oblige. Sorry for my stupidity...I am 64 years old and play with Linux but am not a serious system programmer and I respect you that are. -- Configure bugmail: https://bugzilla.novell.com/userprefs.cgi?tab=email ------- You are receiving this mail because: ------- You are on the CC list for the bug.
https://bugzilla.novell.com/show_bug.cgi?id=304657#c15
--- Comment #15 from Jiří Suchomel
https://bugzilla.novell.com/show_bug.cgi?id=304657#c16
--- Comment #16 from Richard Creighton
Index: OSRFstab.ycp =================================================================== --- OSRFstab.ycp (revision 40176) +++ OSRFstab.ycp (working copy) @@ -267,6 +267,17 @@ linux_partition_list = add (linux_partition_list, partition["device"]:""); } + else if (partition["type"]:`primary == `sw_raid) + { + boolean checked = true; + foreach (string d, partition["devices"]:[], { + if (!contains (checked_partitions, d)) + checked = false; + }); + if (checked) + linux_partition_list = add (linux_partition_list, + partition["device"]:""); + } } else {
-- Configure bugmail: https://bugzilla.novell.com/userprefs.cgi?tab=email ------- You are receiving this mail because: ------- You are on the CC list for the bug.
https://bugzilla.novell.com/show_bug.cgi?id=304657#c17
--- Comment #17 from Richard Creighton
https://bugzilla.novell.com/show_bug.cgi?id=304657#c18
Richard Creighton
https://bugzilla.novell.com/show_bug.cgi?id=304657#c19
--- Comment #19 from Richard Creighton
This screenshot is VERY interesting because it shows that it knew about MD1 and MD3 but immediately it also shows that it intended to do file system checks on sda1 and sdb2 among others which are NOT part of the installed Beta 3 system if you will refer to the PartTbl10.3.png. Those devices are part of the 10.2 file system shown also in Partn10.2.png.
How is that possible? PartTbl10.3.png shows that sdb1 is used by md1, sdb2 by md0 etc. Thomas: Storage::GetTargetMap reports that /dev/sdb2 in devices list of /dev/md0:
It took me a bit to understand the question, sorry. SDAx is part of the IDE drive which is part of the 10.2 system which should not be being subject to the repair/test. SDB to SDE are part of MD0 to MD3 which contain the actual filesystems. The SDB-E are only partition/disks which themselves do not contain a filesystem of anykind and should not have fsck run against them as it will always fail. The partition table clearly states that these devices are part of MDRaid devices and not regular EXT2/3 or FAT or other normal type of file systems that can be checked by some variant of FSCK. Thus, the attempt to run fsck /dev/sdbN I feel is part of the problem as there certainly would be nothing to repair. Anything repairable would be withing the filesystem contained in MD1 or 2 for instance, and that was never checked at any time, nor was it offerd to be checked. If I understand your question, that is what I was trying to point out by the comment #12 above. -- Configure bugmail: https://bugzilla.novell.com/userprefs.cgi?tab=email ------- You are receiving this mail because: ------- You are on the CC list for the bug.
https://bugzilla.novell.com/show_bug.cgi?id=304657#c20
Jiří Suchomel
https://bugzilla.novell.com/show_bug.cgi?id=304657#c21
Richard Creighton
Richard, was yout test attempt with the patch different to the previous ones without the patch? Were more/less partitions checked or reported as broken?
I submitted the logs from the test in #18, but in a word, It still tested the raw devices and not the MD arrays the devices were part of. My sense was 'same'. Something (not your patch) caused the installation to destruct (after an update) and resulted in having to completely reinstall B3 which I did by starting again as a functioning 10.2 system and doing an update install of B3 (which still failed) which did go more smoothly despite the failure. I can see the effects of all the work because most of the install came from FACTORY. GRUB still got grunged, repair still doesn't (this bug), repair wants to 'fix' a perfectly good IDE 10.2 installation (which I have now removed for protection) but overall I can see progress and I want you to know it is appreciated. Short answer: No effect because of patch noted -- Configure bugmail: https://bugzilla.novell.com/userprefs.cgi?tab=email ------- You are receiving this mail because: ------- You are on the CC list for the bug.
https://bugzilla.novell.com/show_bug.cgi?id=304657#c22
Thomas Fehr
https://bugzilla.novell.com/show_bug.cgi?id=304657#c23
--- Comment #23 from Richard Creighton
https://bugzilla.novell.com/show_bug.cgi?id=304657#c24
--- Comment #24 from Jiří Suchomel
https://bugzilla.novell.com/show_bug.cgi?id=304657#c25
Jiří Suchomel
https://bugzilla.novell.com/show_bug.cgi?id=304657#c26
Richard Creighton
https://bugzilla.novell.com/show_bug.cgi?id=304657#c27
--- Comment #27 from Jiří Suchomel
https://bugzilla.novell.com/show_bug.cgi?id=304657#c28
--- Comment #28 from Richard Creighton
Thanks much for a test and a report! (Also the idea of clearing y2log was nice)
I'll try to do more, but unfortunately I do not expect having real fix for this issue ready for 10.3, as it is close to release and I have some more important tasks :-( But I hope it could get better in next version.
My pleasure. Apparently we have a fan club. I have received several E-Mail from several countries from people with MD raid installs that have wanted a solution for quite some time. I was one of the original authors of QuickBBS before the internet got going so I understand what it is to have to troubleshoot remotely and the value of a good report. What you are doing exceeds by far what programming I used to do but I haven't forgotten deadlines and pressures of release dates. Technology has moved so fast that I feel like I'm moving backwards :) Remember the goal is to make Linux work for the average person. It needs to be smart so they don't have to be. This repair module is one of the most important tools on the whole DVD, IMNSHO, not just for MD RAID, but for all users. -- Configure bugmail: https://bugzilla.novell.com/userprefs.cgi?tab=email ------- You are receiving this mail because: ------- You are on the CC list for the bug.
https://bugzilla.novell.com/show_bug.cgi?id=304657#c29
Richard Creighton
https://bugzilla.novell.com/show_bug.cgi?id=304657
User jsuchome@novell.com added comment
https://bugzilla.novell.com/show_bug.cgi?id=304657#c30
Jiří Suchomel
https://bugzilla.novell.com/show_bug.cgi?id=304657
User rccj@ricreig.com added comment
https://bugzilla.novell.com/show_bug.cgi?id=304657#c31
Richard Creighton
Hi Richard, could you please try to install Beta1 of openSUSE11.0, do the scanning of your disks and attach the fresh YaST logfiles?
Do not try to repair anything, I haven't had time to work on the detection, so I expect the state will be very similar to the current one. But I'm requesting logs as a starting point of my work based on openSUSE 11...
Hi, It will be a few days before I can set up a machine to do this. The machine I was using before is now a production machine. As I know you are interested in the repair functions in a MD RAID environment, I will have to set up a machine to do this again. In the FWIW dept., I have been unable to do a Beta1 install from the DVD (yes, the DVD verifies) but I can sucessfully UPGRADE from FACTORY, just not by the DVD. When the upgrade is complete, I seem to have a B1 installation, albeit, not on a RAID machine, but on otherwise the same motherboard and memory, etc. Would it help to run the repair function on that machine until I can upgrade it to a MD raid configuration? Again, for the moment, I am unable to do an INSTALL or UPGRADE from the DVD, only from factory over the internet via YAST. Richard -- Configure bugmail: https://bugzilla.novell.com/userprefs.cgi?tab=email ------- You are receiving this mail because: ------- You are on the CC list for the bug.
https://bugzilla.novell.com/show_bug.cgi?id=304657
User jsuchome@novell.com added comment
https://bugzilla.novell.com/show_bug.cgi?id=304657#c32
--- Comment #32 from Jiří Suchomel
https://bugzilla.novell.com/show_bug.cgi?id=304657
User rccj@ricreig.com added comment
https://bugzilla.novell.com/show_bug.cgi?id=304657#c33
--- Comment #33 from Richard Creighton
Well, I thought it would be easier for you. Now I think it would be better for you not to put time into this and wait until I have available at least somehow improved version.
I really think this is one of the more important bugs to squash. If a Windoze convert has a problem and tries to 'repair' something and it boogers it worse, OpenSuSE will really get a bad reputation quick. I can't let that happen because it is difficult or inconvenient or time consuming and I am retired and other than a heap of doctor appointments, I have little else to do so other than going out and getting a few disk drives and configuring them as a raid array in the new machine, I am happy to work with you. All I need to know is if a baseline repair log on this machine would be useful to you or not even without the raid configuration setup yet? I do intend to invest whatever time it takes to help you make this work. It is one of the more important bugs facing OpenSuSE, IMNSOHO. (In My Not So Humble Opinion) :) -- Configure bugmail: https://bugzilla.novell.com/userprefs.cgi?tab=email ------- You are receiving this mail because: ------- You are on the CC list for the bug.
https://bugzilla.novell.com/show_bug.cgi?id=304657
User jsuchome@novell.com added comment
https://bugzilla.novell.com/show_bug.cgi?id=304657#c34
--- Comment #34 from Jiří Suchomel
https://bugzilla.novell.com/show_bug.cgi?id=304657
User rccj@ricreig.com added comment
https://bugzilla.novell.com/show_bug.cgi?id=304657#c35
--- Comment #35 from Richard Creighton
No, it doesn't have sense without the raid setup.
I will try to have it set up sometime this weekend. I have catarat surgery scheduled but I think that won't interfere. I'll get back to you in a few days when I have it set up. That will also give you some time to work on the code on your end. With luck, by early next week, I will have the drives installed and a MD raid working on that machine. It will be B1 with about 1TB of RAID 5 Software raid MD raid, similar to the setup I used to have and will use the identical motherboard, bios and memory. I liked that machine, so I duplicated it except for the raid :) -- Configure bugmail: https://bugzilla.novell.com/userprefs.cgi?tab=email ------- You are receiving this mail because: ------- You are on the CC list for the bug.
https://bugzilla.novell.com/show_bug.cgi?id=304657
User jsuchome@novell.com added comment
https://bugzilla.novell.com/show_bug.cgi?id=304657#c36
Jiří Suchomel
https://bugzilla.novell.com/show_bug.cgi?id=304657
User jsuchome@novell.com added comment
https://bugzilla.novell.com/show_bug.cgi?id=304657#c37
Jiří Suchomel
participants (1)
-
bugzilla_noreply@novell.com