RE: [SLE] Hangs on 2.4.21-169, LVM, reiser, megaraid2 (update)
True to form, 4 days after the last re-boot we had another file system IO hang. This time under heavy IO. Same symptoms as usual, eventually any NEW process that tries to access the affected file system hangs and cannot be killed even with "-9". We've directed the kernel messages to the serial port and there were no messages in our logs or on our serial port capture. Seriously, nothing. Prior to this file system hang we usually experience some sort of pipe hang as well. We've go lots of scripts that move files, pipe stdio, grep logs etc. When one of these processes hangs we can go to /proc/<pid>/fs and cat the files (links) listed. Usually they are pipes, sometimes files. In some cases, simply cat-ing the files will cause whatever is blocking to free up. The processes finish normally after that. No, I am not kidding. This kind of "pipe hang" is almost always followed within 24 hours by a file system hang. It would seem that the file system hangs apply to new processes, not always those currently running. Our Oracle database with files on our affected file system shutdown cleanly with a "shutdown immediate". Obviously, if the "broken" file system happens to be root, then we are toast. Here is some PS output that shows the processes that are hung accessing our dead file system. I tried to get as much info as possible. Notice the WCHAN and STAT columns on the dead processes. They are usually "down" for the stat column (duh), and rwsem_down_fail for the wchan column. # ps -eo user,stat,pid,ppid,cpu,vsz,rss,%mem,tty,start,time,c,pri,ni,addr,sz,stime,tty,wchan=WCHANLONGOUTPUT -o cmd USER STAT PID PPID CPU VSZ RSS %MEM TT STARTED TIME C PRI NI ADDR SZ STIME TT WCHANLONGOUTPUT CMD sis S 709 31138 - 2532 1264 0.0 ? 23:33:33 00:00:00 0 23 0 - 633 23:33 ? wait4 /bin/sh /u01/home/sis/bin/loadi umkc_sis sndb 20040560800 root S 1254 1 - 4092 956 0.0 ? Feb 20 00:00:03 0 24 0 - 1023 Feb20 ? schedule_timeou /usr/sbin/sshd sis D 1352 709 - 2532 1280 0.0 ? 23:33:36 00:00:00 0 24 0 - 633 23:33 ? down /bin/sh /u01/home/sis/bin/loadi umkc_sis sndb 20040560800 sis S 1378 1352 - 2540 1284 0.0 ? 23:33:36 00:00:00 0 24 0 - 635 23:33 ? wait4 /bin/sh /u01/home/sis/bin/loaddbst umkc_sis sec_general_info sis D 4878 1352 - 2540 1284 0.0 ? 23:34:16 00:00:00 0 24 0 - 635 23:34 ? down /bin/sh /u01/home/sis/bin/loaddbst umkc_sis sec_instructor sis D 5013 1378 - 1580 584 0.0 ? 23:34:18 00:00:08 0 16 0 - 395 23:34 ? rwsem_down_fail mv /u01/home/sis/data/umkc_sis.sec_general_info.dat /u01/home/sis/save sis S 6148 31138 - 2532 1272 0.0 ? 23:34:34 00:00:00 0 24 0 - 633 23:34 ? wait4 /bin/sh /u01/home/sis/bin/loadi umsl_sis hsdb 20040560800 sis D 6453 1 - 2540 1284 0.0 ? 23:34:35 00:00:00 0 24 0 - 635 23:34 ? down /bin/sh /u01/home/sis/bin/loaddbst umsl_sis high_school sis D 6537 1352 - 2540 1284 0.0 ? 23:34:36 00:00:00 0 24 0 - 635 23:34 ? down /bin/sh /u01/home/sis/bin/loaddbst umkc_sis sec_joint_course sis S 7254 1 - 2568 1312 0.0 ? 23:34:39 00:00:00 0 23 0 - 642 23:34 ? wait4 /bin/sh /u01/home/sis/bin/prepsh umsl_sis slif sis D 7395 1 - 2540 1284 0.0 ? 23:34:45 00:00:00 0 24 0 - 635 23:34 ? down /bin/sh /u01/home/sis/bin/loaddbst umsl_sis high_school_subject sis D 7476 7254 - 1568 568 0.0 ? 23:34:46 00:00:00 0 21 0 - 392 23:34 ? rwsem_down_fail mv /u01/home/sis/prep/umsl_sis.stu_slif.dat /u01/home/sis/data sis D 7504 1352 - 2540 1284 0.0 ? 23:34:46 00:00:00 0 24 0 - 635 23:34 ? down /bin/sh /u01/home/sis/bin/loaddbst umkc_sis sec_section_comments umdba S 7581 1 - 2512 1248 0.0 pts/4 23:34:48 00:00:00 0 24 0 - 628 23:34 pts/4 wait4 sh -c sisxc umdba D 7582 7581 - 1740 652 0.0 pts/4 23:34:48 00:00:00 0 22 0 - 435 23:34 pts/4 down ls -l data/ sis S 8277 31138 - 2524 1268 0.0 ? 23:34:54 00:00:00 0 21 0 - 631 23:34 ? wait4 /bin/sh /u01/home/sis/bin/loadtime 20040560800 20040560500 sis D 8278 8277 - 2524 1276 0.0 ? 23:34:54 00:00:00 0 24 0 - 631 23:34 ? down /bin/sh /u01/home/sis/bin/loadtime 20040560800 20040560500 sis D 8285 6148 - 1568 568 0.0 ? 23:34:55 00:00:00 0 24 0 - 392 23:34 ? rwsem_down_fail mv -f /u01/home/sis/data/umsl_sis.hsdb.info /u01/home/sis/save sis D 8313 1352 - 2528 1260 0.0 ? 23:34:56 00:00:00 0 21 0 - 632 23:34 ? down /bin/sh /u01/home/sis/bin/loaddbst umkc_sis sec_room_request sis D 9147 1352 - 2528 1260 0.0 ? 23:35:07 00:00:00 0 21 0 - 632 23:35 ? down /bin/sh /u01/home/sis/bin/loaddbst umkc_sis sec_location_pref sis D 10197 1352 - 2528 1260 0.0 ? 23:35:17 00:00:00 0 21 0 - 632 23:35 ? down /bin/sh /u01/home/sis/bin/loaddbst umkc_sis sec_sched_comments sis S 11681 1 - 2568 1312 0.0 ? 23:35:45 00:00:00 0 21 0 - 642 23:35 ? wait4 /bin/sh /u01/home/sis/bin/prepsh umkc_sis cedb sis D 11716 11681 - 1972 548 0.0 ? 23:35:45 00:00:00 0 20 0 - 493 23:35 ? down /u01/home/sis/bin/sis -d umkc_sis.cedb.raw sis S 11759 1 - 2568 1312 0.0 ? 23:35:46 00:00:00 0 21 0 - 642 23:35 ? wait4 /bin/sh /u01/home/sis/bin/prepsh umkc_sis cmdb sis D 11794 11759 - 1984 556 0.0 ? 23:35:46 00:00:00 0 19 0 - 496 23:35 ? down /u01/home/sis/bin/sis -d umkc_sis.cmdb.raw sis S 12302 1 - 2568 1312 0.0 ? 23:35:53 00:00:00 0 20 0 - 642 23:35 ? wait4 /bin/sh /u01/home/sis/bin/prepsh umkc_sis crdb sis D 12335 12302 - 1968 544 0.0 ? 23:35:53 00:00:00 0 19 0 - 492 23:35 ? down /u01/home/sis/bin/sis -d umkc_sis.crdb.raw root S 14290 1254 - 4848 1948 0.0 ? 21:20:38 00:00:00 0 24 0 - 1212 21:20 ? schedule_timeou /usr/sbin/sshd umdba S 14295 14290 - 2828 1696 0.0 pts/1 21:20:41 00:00:00 0 23 0 - 707 21:20 pts/1 wait4 -bash umdba S 20075 1 - 2508 1232 0.0 pts/4 23:42:42 00:00:00 0 24 0 - 627 23:42 pts/4 wait4 sh -c sisxc umdba D 20076 20075 - 1740 652 0.0 pts/4 23:42:42 00:00:00 0 20 0 - 435 23:42 pts/4 rwsem_down_fail ls -l data/ sis D 22623 32242 - 1752 672 0.0 pts/1 23:50:07 00:00:00 0 24 0 - 438 23:50 pts/1 down ls -N --color=tty -T 0 root S 23256 1254 - 4764 1900 0.0 ? 23:51:10 00:00:00 0 24 0 - 1191 23:51 ? schedule_timeou /usr/sbin/sshd umdba S 23284 23256 - 2820 1692 0.0 pts/6 23:51:12 00:00:00 0 24 0 - 705 23:51 pts/6 wait4 -bash umdba D 23372 23284 - 1744 668 0.0 pts/6 23:51:27 00:00:00 0 24 0 - 436 23:51 pts/6 rwsem_down_fail ls -N --color=tty -T 0 -l /u02 oracle D 23678 1 - 1768 680 0.0 ? 23:53:34 00:00:00 0 24 0 - 442 23:53 ? rwsem_down_fail ls -N --color=tty -T 0 -l root S 31060 2204 - 1460 624 0.0 ? 18:00:00 00:00:00 0 24 0 - 365 18:00 ? pipe_wait /USR/SBIN/CRON sis S 31066 31060 - 2232 1064 0.0 ? 18:00:00 00:00:00 0 20 0 - 558 18:00 ? wait4 /bin/sh -c /u01/home/sis/bin/loadsh 0800 0600 0500 2>&1 > $HOME/log/loadsh.log.`date +\%w` sis S 31070 31066 - 2564 1360 0.0 ? 18:00:00 00:00:00 0 19 0 - 641 18:00 ? wait4 /bin/sh /u01/home/sis/bin/loadsh 0800 0600 0500 sis S 31138 31070 - 2524 1268 0.0 ? 18:00:01 00:00:13 0 24 0 - 631 18:00 ? pipe_wait /bin/sh /u01/home/sis/bin/loadtime 20040560800 20040560500 sis S 32242 14295 - 2816 1692 0.0 pts/1 21:52:10 00:00:00 0 24 0 - 704 21:52 pts/1 wait4 /bin/bash Any help is highly appreciated! Andy ________________________________ From: McAllister, Andrew [mailto:McAllisterA@umsystem.edu] Sent: Fri 2/20/2004 5:28 PM To: suse-linux-e@suse.com Subject: RE: [SLE] Hangs on 2.4.21-169, LVM, reiser, megaraid2
Hardware: 2 PERC 4 DC controllers (LSI) and are running the megaraid2 driver v2.00.8 (Release Date: Wed Aug 27 18:50:49 EDT 2003) Software: It would be really helpful if you can provide kernel messages. You can hook up a null modem cable to the serial port and collect all messages.
Doing that tonight. We'll have to reboot to enable the output to ttyS0, so that will probably preclude any sort of hang for another 3-5 days. snip The disk IO hangs have been of two kinds. One where an entire mounted file system hung completely, if you do an ls, your process hangs and cannot be killed. The other, some processes still work on the file system, but some simple processes like cp and mv are hung and cannot be killed. Our file systems are 146 gig reiser made up of 3 logical volumes controlled by LVM. The three logical volumes are actually RAID arrays made up of 1/3 of a mirrored pair of 146 drives. Sounds strange to split up a drive into three pieces and then put it back together, but the Dell techs said this would let us get enough IO queues per drive to overcome the linux hardware raid bottleneck. I'll post again on the next hang. Thanks Andy snip
participants (1)
-
McAllister, Andrew