Mailinglist Archive: opensuse-factory (1233 mails)

< Previous Next >
Re: [opensuse-factory] Full system freeze - how can be investigated?
  • From: Dave Plater <dave.plater@xxxxxxxxxxx>
  • Date: Sat, 29 Nov 2008 10:01:55 +0200
  • Message-id: <4930F6F3.70305@xxxxxxxxxxx>
Carlos E. R. wrote:


On Friday, 2008-11-28 at 07:56 +0200, Dave Plater wrote:

There is a Yast module to set it up and it's supposed to trigger on an
oops.

I saw <http://en.opensuse.org/YaST/Modules/Kdump> where it says:

] Proposal of new YaST module for configuration kdump

so I understood that the module didn't exist yet. Somebody forgot to
update the wiki, then.

There is also an alt/sysrq/c keyboard short cut to activate it.

which will not work, system freezes. But I'll try. I'll have to write
this on paper :-}

What I would do first with a problem such as your's is downgrade to a
previous kernel,

I'm not sure it will work, though. I think there was a previous freeze
a month ago, but i dismissed it as chance. It doesn't freeze
inmedetially, anyway. It run for an hour or so.

reboot, save boot.msg and confirm that it is a kernel
bug. I tried to get a crash dump with an oops I was getting when
starting suspend but the problem went away when kdump was activated so I
haven't actually seen it in action but it's simple to install and setup
with yast.

I'll have to try... provided it doesn't crash while I run yast.

Just out of interest, what is the last line in /var/log/messages at the
time your system freezes?

Nothing important. Let me see...

Nov 23 18:56:14 minas-morgul smartd[3727]: Device: /dev/hdd, SMART
Usage Attribute: 194 Temperature_Celsius changed from 33 to 36
Nov 23 18:56:14 minas-morgul smartd[3727]: Device: /dev/hdd, SMART
Usage Attribute: 195 Hardware_ECC_Recovered changed from 58 to 59
Nov 23 19:22:27 minas-morgul su: (to root) cer on /dev/pts/3
Nov 23 19:22:55 minas-morgul kernel: kjournald starting. Commit
interval 5 seconds
Nov 23 19:22:55 minas-morgul kernel: EXT3 FS on hdd6, internal journal
Nov 23 19:22:55 minas-morgul kernel: EXT3-fs: mounted filesystem with
ordered data mode.

here I had mounted the "stable" root partition

Nov 23 19:22:57 minas-morgul kernel: SGI XFS with ACLs, security
attributes, realtime, large block numbers, dmapi support, no debug enabled
Nov 23 19:22:57 minas-morgul kernel: SGI XFS Quota Management subsystem
Nov 23 19:22:57 minas-morgul kernel: XFS mounting filesystem hda11
Nov 23 19:22:57 minas-morgul kernel: Ending clean XFS mount for
filesystem: hda11

and here the stable (ie, main) home partition, to read somethings there.

Nov 23 19:25:40 minas-morgul syslog-ng[2139]: Log statistics;
dropped='pipe(/dev/xconsole)=0', dropped='pipe(/dev/tty10)=0',
processed='center(queued)=233', processed='center(received)=189',
processed='destination(newsnotice)=0',
processed='destination(acpid)=3',
processed='destination(firewall)=15', processed='destination(null)=3',
processed='destination(mail)=2', processed='destination(mailinfo)=2',
processed='destination(console)=11',
processed='destination(newserr)=0',
processed='destination(newscrit)=0',
processed='destination(messages)=166',
processed='destination(mailwarn)=0',
processed='destination(localmessages)=0',
processed='destination(netmgm)=0', processed='destination(mailerr)=0',
processed='destination(xconsole)=11',
processed='destination(warn)=20', processed='source(src)=189'

Nov 23 19:26:15 minas-morgul smartd[3727]: Device: /dev/hda, SMART
Prefailure Attribute: 1 Raw_Read_Error_Rate changed from 107 to 104
Nov 23 19:26:15 minas-morgul smartd[3727]: Device: /dev/hda, SMART
Usage Attribute: 195 Hardware_ECC_Recovered changed from 62 to 55
Nov 23 19:26:15 minas-morgul smartd[3727]: Device: /dev/hdb, SMART
Prefailure Attribute: 1 Raw_Read_Error_Rate changed from 57 to 59
Nov 23 19:26:15 minas-morgul smartd[3727]: Device: /dev/hdb, SMART
Usage Attribute: 194 Temperature_Celsius changed from 27 to 28
Nov 23 19:26:15 minas-morgul smartd[3727]: Device: /dev/hdd, SMART
Usage Attribute: 194 Temperature_Celsius changed from 36 to 35


But the firewall log contains much later entries at Nov 23 19:51:34,
so nothing above is related to the freeze.

-- Cheers,
Carlos E. R.
There's a program to log exception faults called mcelog you may be able
to get some more specific information from that. Read
/usr/src/linux/Documentation/kernel-parameters.txt to find various boot
parameters for the kernel, there is one called "debug" which apparently
increases log verbosity to 10 which may help. I assume you have done a
memtest as well? If after downgrading the kernel the problem disappears
you can try a diff between the kernel source .configs after make
cloneconfig and you may find boot options to remove any new kernel
options with the problem kernel.
Lastly you may get better help at opensuse-kernel list.
Regards
Dave P

--
To unsubscribe, e-mail: opensuse-factory+unsubscribe@xxxxxxxxxxxx
For additional commands, e-mail: opensuse-factory+help@xxxxxxxxxxxx

< Previous Next >