RE: [SLE] Hangs on 2.4.21-169, LVM, reiser, megaraid2 (update)

25 Feb 2004

      True to form, 4 days after the last re-boot

Same symptoms as usual, eventually any

Prior to this file system hang we usually

It would seem that the file system hangs

Here is some PS output that shows the

# ps -eo user,stat,pid,ppid,cpu,vsz,rss,%mem,tt USER     STAT   PID  PPID CPU   VSZ  RSS %MEM TT sis      S      709 31138   -  2532 1264  0.0 ? root     S     1254     1   -  4092  956  0.0 ? sis      D     1352   709   -  2532 1280  0.0 ? sis      S     1378  1352   -  2540 1284  0.0 ? sis      D     4878  1352   -  2540 1284  0.0 ? sis      D     5013  1378   -  1580  584  0.0 ? sis      S     6148 31138   -  2532 1272  0.0 ? sis      D     6453     1   -  2540 1284  0.0 ? sis      D     6537  1352   -  2540 1284  0.0 ? sis      S     7254     1   -  2568 1312  0.0 ? sis      D     7395     1   -  2540 1284  0.0 ? sis      D     7476  7254   -  1568  568  0.0 ? sis      D     7504  1352   -  2540 1284  0.0 ? umdba    S     7581     1   -  2512 1248  0.0 pts/4 umdba    D     7582  7581   -  1740  652  0.0 pts/4 sis      S     8277 31138   -  2524 1268  0.0 ? sis      D     8278  8277   -  2524 1276  0.0 ? sis      D     8285  6148   -  1568  568  0.0 ? sis      D     8313  1352   -  2528 1260  0.0 ? sis      D     9147  1352   -  2528 1260  0.0 ? sis      D    10197  1352   -  2528 1260  0.0 ? sis      S    11681     1   -  2568 1312  0.0 ? sis      D    11716 11681   -  1972  548  0.0 ? sis      S    11759     1   -  2568 1312  0.0 ? sis      D    11794 11759   -  1984  556  0.0 ? sis      S    12302     1   -  2568 1312  0.0 ? sis      D    12335 12302   -  1968  544  0.0 ? root     S    14290  1254   -  4848 1948  0.0 ? umdba    S    14295 14290   -  2828 1696  0.0 pts/1 umdba    S    20075     1   -  2508 1232  0.0 pts/4 umdba    D    20076 20075   -  1740  652  0.0 pts/4 sis      D    22623 32242   -  1752  672  0.0 pts/1 root     S    23256  1254   -  4764 1900  0.0 ? umdba    S    23284 23256   -  2820 1692  0.0 pts/6 umdba    D    23372 23284   -  1744  668  0.0 pts/6 oracle   D    23678     1   -  1768  680  0.0 ? root     S    31060  2204   -  1460  624  0.0 ? sis      S    31066 31060   -  2232 1064  0.0 ? sis      S    31070 31066   -  2564 1360  0.0 ? sis      S    31138 31070   -  2524 1268  0.0 ? sis      S    32242 14295   -  2816 1692  0.0 pts/1

Any help is highly appreciated! Andy

________________________________

From: McAllister, Andrew [mailto:McAllisterA@um Sent: Fri 2/20/2004 5:28 PM To: suse-linux-e@suse.com Subject: RE: [SLE] Hangs on 2.4.21-169,

we had another file system IO hang. This time under heavy IO. NEW process that tries to access the affected file system hangs and cannot be killed even with "-9". We've directed the kernel messages to the serial port and there were no messages in our logs or on our serial port capture. Seriously, nothing. experience some sort of pipe hang as well. We've go lots of scripts that move files, pipe stdio, grep logs etc. When one of these processes hangs we can go to /proc/<pid>/fs and cat the files (links) listed. Usually they are pipes, sometimes files. In some cases, simply cat-ing the files will cause whatever is blocking to free up. The processes finish normally after that. No, I am not kidding. This kind of "pipe hang" is almost always followed within 24 hours by a file system hang. apply to new processes, not always those currently running. Our Oracle database with files on our affected file system shutdown cleanly with a "shutdown immediate". Obviously, if the "broken" file system happens to be root, then we are toast. processes that are hung accessing our dead file system. I tried to get as much info as possible. Notice the WCHAN and STAT columns on the dead processes. They are usually "down" for the stat column (duh), and rwsem_down_fail for the wchan column. y,start,time,c,pri,ni,addr,sz,stime,tty,wchan=WCHANLONGOUTPUT -o cmd STARTED     TIME  C PRI  NI ADDR    SZ STIME TT       WCHANLONGOUTPUT CMD 23:33:33 00:00:00  0  23   0    -   633 23:33 ?        wait4           /bin/sh /u01/home/sis/bin/loadi umkc_sis sndb 20040560800 Feb 20 00:00:03  0  24   0    -  1023 Feb20 ?        schedule_timeou /usr/sbin/sshd 23:33:36 00:00:00  0  24   0    -   633 23:33 ?        down            /bin/sh /u01/home/sis/bin/loadi umkc_sis sndb 20040560800 23:33:36 00:00:00  0  24   0    -   635 23:33 ?        wait4           /bin/sh /u01/home/sis/bin/loaddbst umkc_sis sec_general_info 23:34:16 00:00:00  0  24   0    -   635 23:34 ?        down            /bin/sh /u01/home/sis/bin/loaddbst umkc_sis sec_instructor 23:34:18 00:00:08  0  16   0    -   395 23:34 ?        rwsem_down_fail mv /u01/home/sis/data/umkc_sis.sec_general_info.dat /u01/home/sis/save 23:34:34 00:00:00  0  24   0    -   633 23:34 ?        wait4           /bin/sh /u01/home/sis/bin/loadi umsl_sis hsdb 20040560800 23:34:35 00:00:00  0  24   0    -   635 23:34 ?        down            /bin/sh /u01/home/sis/bin/loaddbst umsl_sis high_school 23:34:36 00:00:00  0  24   0    -   635 23:34 ?        down            /bin/sh /u01/home/sis/bin/loaddbst umkc_sis sec_joint_course 23:34:39 00:00:00  0  23   0    -   642 23:34 ?        wait4           /bin/sh /u01/home/sis/bin/prepsh umsl_sis slif 23:34:45 00:00:00  0  24   0    -   635 23:34 ?        down            /bin/sh /u01/home/sis/bin/loaddbst umsl_sis high_school_subject 23:34:46 00:00:00  0  21   0    -   392 23:34 ?        rwsem_down_fail mv /u01/home/sis/prep/umsl_sis.stu_slif.dat /u01/home/sis/data 23:34:46 00:00:00  0  24   0    -   635 23:34 ?        down            /bin/sh /u01/home/sis/bin/loaddbst umkc_sis sec_section_comments 23:34:48 00:00:00  0  24   0    -   628 23:34 pts/4    wait4           sh -c sisxc 23:34:48 00:00:00  0  22   0    -   435 23:34 pts/4    down            ls -l data/ 23:34:54 00:00:00  0  21   0    -   631 23:34 ?        wait4           /bin/sh /u01/home/sis/bin/loadtime 20040560800 20040560500 23:34:54 00:00:00  0  24   0    -   631 23:34 ?        down            /bin/sh /u01/home/sis/bin/loadtime 20040560800 20040560500 23:34:55 00:00:00  0  24   0    -   392 23:34 ?        rwsem_down_fail mv -f /u01/home/sis/data/umsl_sis.hsdb.info /u01/home/sis/save 23:34:56 00:00:00  0  21   0    -   632 23:34 ?        down            /bin/sh /u01/home/sis/bin/loaddbst umkc_sis sec_room_request 23:35:07 00:00:00  0  21   0    -   632 23:35 ?        down            /bin/sh /u01/home/sis/bin/loaddbst umkc_sis sec_location_pref 23:35:17 00:00:00  0  21   0    -   632 23:35 ?        down            /bin/sh /u01/home/sis/bin/loaddbst umkc_sis sec_sched_comments 23:35:45 00:00:00  0  21   0    -   642 23:35 ?        wait4           /bin/sh /u01/home/sis/bin/prepsh umkc_sis cedb 23:35:45 00:00:00  0  20   0    -   493 23:35 ?        down            /u01/home/sis/bin/sis -d umkc_sis.cedb.raw 23:35:46 00:00:00  0  21   0    -   642 23:35 ?        wait4           /bin/sh /u01/home/sis/bin/prepsh umkc_sis cmdb 23:35:46 00:00:00  0  19   0    -   496 23:35 ?        down            /u01/home/sis/bin/sis -d umkc_sis.cmdb.raw 23:35:53 00:00:00  0  20   0    -   642 23:35 ?        wait4           /bin/sh /u01/home/sis/bin/prepsh umkc_sis crdb 23:35:53 00:00:00  0  19   0    -   492 23:35 ?        down            /u01/home/sis/bin/sis -d umkc_sis.crdb.raw 21:20:38 00:00:00  0  24   0    -  1212 21:20 ?        schedule_timeou /usr/sbin/sshd 21:20:41 00:00:00  0  23   0    -   707 21:20 pts/1    wait4           -bash 23:42:42 00:00:00  0  24   0    -   627 23:42 pts/4    wait4           sh -c sisxc 23:42:42 00:00:00  0  20   0    -   435 23:42 pts/4    rwsem_down_fail ls -l data/ 23:50:07 00:00:00  0  24   0    -   438 23:50 pts/1    down            ls -N --color=tty -T 0 23:51:10 00:00:00  0  24   0    -  1191 23:51 ?        schedule_timeou /usr/sbin/sshd 23:51:12 00:00:00  0  24   0    -   705 23:51 pts/6    wait4           -bash 23:51:27 00:00:00  0  24   0    -   436 23:51 pts/6    rwsem_down_fail ls -N --color=tty -T 0 -l /u02 23:53:34 00:00:00  0  24   0    -   442 23:53 ?        rwsem_down_fail ls -N --color=tty -T 0 -l 18:00:00 00:00:00  0  24   0    -   365 18:00 ?        pipe_wait       /USR/SBIN/CRON 18:00:00 00:00:00  0  20   0    -   558 18:00 ?        wait4           /bin/sh -c /u01/home/sis/bin/loadsh 0800 0600 0500 2>&1 > $HOME/log/loadsh.log.`date +\%w` 18:00:00 00:00:00  0  19   0    -   641 18:00 ?        wait4           /bin/sh /u01/home/sis/bin/loadsh 0800 0600 0500 18:00:01 00:00:13  0  24   0    -   631 18:00 ?        pipe_wait       /bin/sh /u01/home/sis/bin/loadtime 20040560800 20040560500 21:52:10 00:00:00  0  24   0    -   704 21:52 pts/1    wait4           /bin/bash system.edu] LVM, reiser, megaraid2
...
...
Hardware: 
2 PERC 4 DC controllers (LSI) and are running the megaraid2 
driver v2.00.8 (Release Date: Wed Aug 27 18:50:49 EDT 2003) 
Software: 
It would be really helpful if you can provide kernel messages. You can hook 
up a null modem cable to the serial port and collect all messages.
Doing that tonight. We'll have to reboot to enable the output to ttyS0, so that will probably preclude any sort of hang for another 3-5 days.

snip

The disk IO hangs have been of two kinds. One where an entire mounted file system hung completely, if you do an ls, your process hangs and cannot be killed. The other, some processes still work on the file system, but some simple processes like cp and mv are hung and cannot be killed.

Our file systems are 146 gig reiser made up of 3 logical volumes controlled by LVM. The three logical volumes are actually RAID arrays made up of 1/3 of a mirrored pair of 146 drives. Sounds strange to split up a drive into three pieces and then put it back together, but the Dell techs said this would let us get enough IO queues per drive to overcome the linux hardware raid bottleneck.

I'll post again on the next hang. 

Thanks 
Andy

snip

McAllister, Andrew

tags

participants (1)