Reiser gone bad -- recovery how?
I think I'm in a world-of-hurt: I was "running out" of room on the partition that held the pages served by my webserver [running on SuSE 7.x, meaning they were "hosted" under /usr/local/httpd] so I decided to copy the "tree" to another partition where I would then alter the .conf file to point to the new partition [which I was intending to call "/srv/www", since that seems to be the way 8.x/9.x does it] HOWEVER, at some point in the copy process "things went bad", and the system locked up. [we're talking flashing caps/scroll lock lights] During the reboot and subsequent transaction replay [reiser], I got a null pointer exception and the system hung again -- rebooting THIS time returned "cannot find a valid reiserfs partition on 09:00" OK, "rescue CD" time -- took a while to "remember" exactly how I had this system set up, but in a nutshell it's like this: two identical hard drives [10gb apiece -- IBM deskstars as I recall] partitioned the same down the line. /dev/hd[ab]5 is joined together as a "raid" device and is known as /dev/md0 -- this is my "root" directory. The rest of the partitions: /dev/hd[ab][6-8], are joined together as "volume groups" -- vg00/system is "/var", vg01/data is "/home", and vg02/backup is "/ snapshot", which is where I was moving my webpages to [largest free space...] reiserfsck reported it couldn't find a superblock, so I tried to "force" the creation of a new one with parameter "--rebuild-sb" -- here is where I might have shot myself in the foot: rebuild-sb suggested a block size of 4096, however I recalled that during the boot sequence when it couldn't find anything on device 09:00 that there were references to a 1024-byte block size, so I overrode "4096" with "1024". I then found out/realized/whatever that I also had to perform a "--rebuild-tree" operation, and this is the scary part: the program reported THOUSANDS of "size (...) should be (...)" error messages where the first and second elipses seemed to toggle between 0 and 1000. NOW reiserfsck reports "FATAL corruption found" -- worse still, the "--rebuild-tree" operation fails with "no reister metadata found whatsovever, have you repartitioned?" [or words to similar effect] and goes on to suggest a "quick fix" on how to find/identify the actual original superblock [the presumption being that you've re-partitioned and "moved" things around] So, "am I hosed"? Where do I go from here? I know I haven't "repartitioned" the drive, so the hinted-at help in the reiserfsck program won't help. -- Yet another Blog: http://osnut.homelinux.net
-----BEGIN PGP SIGNED MESSAGE----- Hash: SHA1 On Thursday 11 December 2003 02:11 am, Tom Emerson wrote:
HOWEVER, at some point in the copy process "things went bad", and the system locked up. [we're talking flashing caps/scroll lock lights] During the reboot and subsequent transaction replay [reiser], I got a null pointer exception and the system hung again -- rebooting THIS time returned "cannot find a valid reiserfs partition on 09:00"
I had the EXACT same problem, including your other symptoms below. It only happened when copying from one drive to another. As it turned out, my power supply turned out to be too small for my system. A bigger power supply solved the problem.
reiserfsck reported it couldn't find a superblock, so I tried to "force" the creation of a new one with parameter "--rebuild-sb" -- here is where I might have shot myself in the foot: rebuild-sb suggested a block size of 4096, however I recalled that during the boot sequence when it couldn't find anything on device 09:00 that there were references to a 1024-byte block size, so I overrode "4096" with "1024".
ALWAYS TAKE THE ADVICE OF THE REISERFS TOOLS, UNLESS YOU ARE A REISERFS DEVELOPER! I'm pretty sure that this is where you shot yourself in the foot.
I then found out/realized/whatever that I also had to perform a "--rebuild-tree" operation, and this is the scary part: the program reported THOUSANDS of "size (...) should be (...)" error messages where the first and second elipses seemed to toggle between 0 and 1000.
This is where your drive *really* got hosed. As I said above, I had the exact same problem. I accepted the reiserfsck suggestion in --rebuild-sb. After that, a simple --rebuild-tree fixed it. Actually. I had the same problem 5 or 6 times and went through this process each time before I started replacing everything and discovered the *real* problem. In the end I put all of my old hardware back together with a new power supply, and the computer is still running right now. Did you backup the partition like reiserfsck told you? The chances are good that you'll never recover unless you made a backup. You could try asking the ReiserFS developers for help. They *might* be able to help you. - -- James Oakley Engineering - SolutionInc Ltd. joakley@solutioninc.com http://www.solutioninc.com -----BEGIN PGP SIGNATURE----- Version: GnuPG v1.2.2 (GNU/Linux) iD8DBQE/2Njn+FOexA3koIgRAlhpAJ9Jm8kTwL1N1fWzHgNrZTcTHmnfvgCffGuG kkBOwNtHI+NfQFUNZwotjxc= =rozp -----END PGP SIGNATURE-----
On Thursday 11 December 2003 12:51 pm, James Oakley wrote:
On Thursday 11 December 2003 02:11 am, Tom Emerson wrote:
HOWEVER, at some point in the copy process "things went bad", and the system locked up. ... rebooting returned "cannot find a valid reiserfs partition on 09:00"
I had the EXACT same problem, ... As it turned out, my power supply turned out to be too small for my system.
I wouldn't expect this to be the problem as this system has been running 24x7 for a couple of years [then again, I *have* had a couple of strange lockups in the last month or so -- it could be "going bad..."]
reiserfsck reported it couldn't find a superblock, so I [rebuilt it]... rebuild-sb suggested a block size of 4096, however ... during the boot there were references to a 1024-byte block size, so I overrode it
ALWAYS TAKE THE ADVICE OF THE REISERFS TOOLS, UNLESS YOU ARE A REISERFS DEVELOPER!
[that is generally what I do, but read on]
I'm pretty sure that this is where you shot yourself in the foot.
Actually, "the first time" I tried to rebuild the table I did choose 4096, but when it was "all done", it reported that it still couldn't find a superblock. When I rebooted, I monitored the messages a little more carefully and noticed the 1024 blocksize, which made me wonder. Another data point is that the system in question was built with SuSE 7.x, but the "rescue" CD is from the 9.0 installation disk -- I figured there had been a change in some of the defaults since that time.
... I also had to perform a "--rebuild-tree" ... the program reported THOUSANDS of "size (...) should be (...)" error messages
This is where your drive *really* got hosed.
I was afraid of that the instant the screen "filled up" with these messages -- unfortunately, once it starts, you can't really [safely] stop it :( [OTOH, since this is the first time I've ever done a "rebuild tree", it is entirely possible that this is/was the expected output -- I really have no way of knowing...]
Did you backup the partition like reiserfsck told you? The chances are good that you'll never recover unless you made a backup.
Equally unforunate, I didn't have a partition "to back up TO", nor do I have a CD burner on this system (it does have a QIC tape drive, but I've never been successful getting it to "work" -- one of my "projects" once I upgraded the system to a newer version was to get the tape drive "working"...) On the plus side, the only thing that was actually "on" this partition that I need is the website proper -- but as I develop the "pages" on other machines and copy to the server, I have the (semi) originals -- what I don't have is them "all in the same place" [nor "up to date" in the other locations -- live and learn :) ] -- Yet another Blog: http://osnut.homelinux.net [down at the moment :( ]
[see the archives for the story so far...] On Thursday 11 December 2003 5:57 pm, Tom Emerson wrote:
On Thursday 11 December 2003 12:51 pm, James Oakley wrote:
ALWAYS TAKE THE ADVICE OF THE REISERFS TOOLS, UNLESS YOU ARE A REISERFS DEVELOPER!
[that is generally what I do, but read on]
Well, I have a bit of GOOD NEWS to report -- it turns out that tonight is/was the night of our monthly Linux user group meeting, so I took the box down to the meeting to get some "expert advice" -- after describing things, I got some suggestions and ACTUAL COMMANDS to use that have helped out tremendously You had asked if I made a backup before I started, and my comment was basically "I didn't have a free PARTITION to copy it to" -- it turns out I was still thinking "inside the box" -- as suggested at the meeting, it turns out that it WAS possible: dd if=/dev/md0 of=/snapshot/oldroot In other words, rather than copying "the partition" to "another partition", I'm copying a "partition" TO a "file". After much copying, "df" reported 30mb remaining on the "snapshot" device [JUST made it!] Then, to "work with" the newly copied file, this command turned out to be invaluable: losetup /dev/loop0 /snapshot/oldroot which, in effect, made the "copy of /dev/md0" available as "/dev/loop0" -- I could even run reiserfsck against /dev/loop0! Now, the next problem [which I didn't really elaborate on earlier]: now that I HAD created a superblock indicating that the blocksize was 1024, any attempt to run "rebuild-sb" again came up with a display showing what it detected for the superblock and asking if it was "ok" -- there didn't seem to be any way to "change" the 1024 value back to 4096! There is some semi-obscure wording in the messages from the reiserfsck output about "zeroing out" the data at offset 64k, but no indication as to "how much" data needed to be zeroed. While I had the rough idea that "dd" would be the tool to use for "zeroing out" the data "at 64k", I wasn't sure of the specifics -- a few reads through the output of "dd --help", however, pointed me in this direction: dd if=/dev/zero of=/dev/loop0 bs=1024 seek=64 count=1 [note that I'm working with the copy of the data...] Once this was done, "rebuild-sb" did indeed let me specify a 4096 byte blocksize. This time running "--rebuild-tree" produced a much more "sane" amount of output -- yes, I still lost some files, but only 140-ish out of who knows/remembers how many there were to begin with. I also gained some 340+ entries in the "lost +found" directory -- a spot-check of these files revealed them to be everything from "rc.config" to entire directories of .jpg photos. So, to recap, the GOOD NEWS is that all my partitions are "readable" again [and linux reboots, mostly...] the NOT SO good news is that I'll have to figure out what these files "used to be" by hand and move/rename them as appropriate -- Yet another Blog: http://osnut.homelinux.net
Tom Emerson
dd if=/dev/md0 of=/snapshot/oldroot
In other words, rather than copying "the partition" to "another partition", I'm copying a "partition" TO a "file". After much copying, "df" reported 30mb remaining on the "snapshot" device [JUST made it!]
30mb = 0.00375 B is really a small amount of disk space. Anyway, it is also possible to compress the file: dd if=/dev/md0 | bzip2 > /snapshot/oldroot.bz2 My experience is that the compression ratio is not quite good for this kind of data so "gzip -1", which is much faster, may be a better option. But then you have to work with the original partition. -- A.M.
participants (3)
-
Alexandr Malusek
-
James Oakley
-
Tom Emerson