Dual Opteron crashes with Kernel Panic

Andreas Wahlert

25 Nov 2004 25 Nov '04

12:46

Hi List, i tried to install SuSE 9.2 on FSC V810 Dual Opteron 250 WS with 4 GB Ram. No way. If i boot the standard smp 64 Bit Kernel i get a kernel panic at the boot. If i boot the standard 64 Bit Kernel without smp, everything seems to work fine. As a result from this tests, i have tried to install SuSE 9.1. This Distri boots with the standard smp 64 Bit kernel bit later it's unstable. I.e. X crashes without any comment or log entrys. There are any experiences regarding this problem?? THX Andreas Wahlert

Show replies by date

Andi Kleen

25 Nov 25 Nov

12:55

New subject: [suse-amd64] Dual Opteron crashes with Kernel Panic

On Thu, Nov 25, 2004 at 01:46:03PM +0100, Andreas Wahlert wrote:

...

Hi List,

i tried to install SuSE 9.2 on FSC V810 Dual Opteron 250 WS with 4 GB Ram. No way. If i boot the standard smp 64 Bit Kernel i get a kernel panic at the boot. If i boot the standard 64 Bit Kernel without smp, everything seems to work fine.

What is the full text of the kernel panic?

...

As a result from this tests, i have tried to install SuSE 9.1. This Distri boots with the standard smp 64 Bit kernel bit later it's unstable. I.e. X crashes without any comment or log entrys.

The X server crashes or the whole machine?

...

There are any experiences regarding this problem??

Other people have running this machine successfully I believe, so it must be some local issue. I would update to the latest BIOS and check the memory with memtest86 for several hours. -Andi

Andreas Wahlert

13:02

New subject: [suse-amd64] Dual Opteron crashes with Kernel Panic

Andi Kleen wrote:

...

On Thu, Nov 25, 2004 at 01:46:03PM +0100, Andreas Wahlert wrote:

...
Hi List,

i tried to install SuSE 9.2 on FSC V810 Dual Opteron 250 WS with 4 GB Ram. No way. If i boot the standard smp 64 Bit Kernel i get a kernel panic at the boot. If i boot the standard 64 Bit Kernel without smp, everything seems to work fine.

What is the full text of the kernel panic?

...
As a result from this tests, i have tried to install SuSE 9.1. This Distri boots with the standard smp 64 Bit kernel bit later it's unstable. I.e. X crashes without any comment or log entrys.

The X server crashes or the whole machine?

...
There are any experiences regarding this problem??

Other people have running this machine successfully I believe, so it must be some local issue. I would update to the latest BIOS and check the memory with memtest86 for several hours.

-Andi

Hi, first the hole maschine crashes. The bios is the latest available. But i know, FSC have setup a "V2" Version of this maschine with a thyan Mainboard for several weeks. The older version of the v810 works very fine without any probs. In the meantime i have tried to install Redhat Fedore 3. Same Kernel Panic: NMI watchdog detected Lockup on CPU0 comm: swapper not tainted Kernel 2.6.x. How can i get the hole panic text?? friendly regards Andreas

Andi Kleen

13:08

New subject: [suse-amd64] Dual Opteron crashes with Kernel Panic

...

In the meantime i have tried to install Redhat Fedore 3. Same Kernel Panic:

NMI watchdog detected Lockup on CPU0

There should be more text after that, including an register dump and a backtrace (an "oops"). You can probably get better output by connecting a null modem and using serial console (boot with console=ttyS0,baudrate) -Andi

Andreas Wahlert

26 Nov 26 Nov

08:55

New subject: [suse-amd64] Dual Opteron crashes with Kernel Panic

Andi Kleen wrote:

...

...
In the meantime i have tried to install Redhat Fedore 3. Same Kernel Panic:

NMI watchdog detected Lockup on CPU0

There should be more text after that, including an register dump and a backtrace (an "oops"). You can probably get better output by connecting a null modem and using serial console (boot with console=ttyS0,baudrate)

-Andi

Ahhhhh, i guess i have found my problem: http://www.x86-64.org/lists/discuss/msg05795.html What's happen with this??? This is extactly "my" kernel panic. But i have just 2 processors. regards Andreas

Andi Kleen

10:31

New subject: [suse-amd64] Dual Opteron crashes with Kernel Panic

...

i guess i have found my problem:

http://www.x86-64.org/lists/discuss/msg05795.html

What's happen with this??? This is extactly "my" kernel panic. But i have just 2 processors.

Can you post the full text? -Andi

Andreas Wahlert

10:48

New subject: [suse-amd64] Dual Opteron crashes with Kernel Panic

Andi Kleen wrote:

...

...
i guess i have found my problem:

http://www.x86-64.org/lists/discuss/msg05795.html

What's happen with this??? This is extactly "my" kernel panic. But i have just 2 processors.

Can you post the full text?

-Andi

It's tricky. i think, the panik is to early. My minicom is blinking "online" a very short time. But there are no output in the minicom interface. I'm in a serious discussion with the FSC Celsius man in Augsburg. perhaps there are any results today or monday. regards Andreas

Andi Kleen

10:53

New subject: [suse-amd64] Dual Opteron crashes with Kernel Panic

On Fri, Nov 26, 2004 at 11:48:07AM +0100, Andreas Wahlert wrote:

...

Andi Kleen wrote:

...
...
i guess i have found my problem:

http://www.x86-64.org/lists/discuss/msg05795.html

What's happen with this??? This is extactly "my" kernel panic. But i have just 2 processors.

Can you post the full text?

-Andi

It's tricky.

i think, the panik is to early. My minicom is blinking "online" a very short time. But there are no output in the minicom interface.

First I would test if the cable works from a working system (baud rate matches etc.) Boot a working kernel and do stty speed <baudrate> < /dev/ttyS0 echo hello > /dev/ttyS0 and check if hello appears in the minicom. If you think the panic is too early you can boot in addition with earlyprintk=serial,ttyS0,baudrate This will print the earlier kernel messages too. -Andi

Andreas Wahlert

14:13

New subject: [suse-amd64] Dual Opteron crashes with Kernel Panic / Dump of the panic

Andi Kleen wrote:

...

On Fri, Nov 26, 2004 at 11:48:07AM +0100, Andreas Wahlert wrote:

...
Andi Kleen wrote:

...
...
i guess i have found my problem:

http://www.x86-64.org/lists/discuss/msg05795.html

What's happen with this??? This is extactly "my" kernel panic. But i have just 2 processors.

Can you post the full text?

-Andi

It's tricky.

i think, the panik is to early. My minicom is blinking "online" a very short time. But there are no output in the minicom interface.

First I would test if the cable works from a working system (baud rate matches etc.) Boot a working kernel and do stty speed <baudrate> < /dev/ttyS0 echo hello > /dev/ttyS0 and check if hello appears in the minicom.

If you think the panic is too early you can boot in addition with earlyprintk=serial,ttyS0,baudrate This will print the earlier kernel messages too.

-Andi

yeeehaaaa, i get it!!! cable was wrong! Shit. ok here is it: Bootdata ok (command line is root=/dev/hda1 vga=normal selinux=0 console=ttyS0,38400 resume=/dev/hda2 desktop elevator=as) Linux version 2.6.8-24-smp (geeko@buildhost) (gcc version 3.3.4 (pre 3.3.5 20040809)) #1 SMP Wed Oct 6 09:16:23 UTC 2004 BIOS-provided physical RAM map: BIOS-e820: 0000000000000000 - 000000000009b800 (usable) BIOS-e820: 000000000009b800 - 00000000000a0000 (reserved) BIOS-e820: 00000000000d2000 - 0000000000100000 (reserved) BIOS-e820: 0000000000100000 - 00000000dff70000 (usable) BIOS-e820: 00000000dff70000 - 00000000dff7f000 (ACPI data) BIOS-e820: 00000000dff7f000 - 00000000dff80000 (ACPI NVS) BIOS-e820: 00000000dff80000 - 00000000e0000000 (reserved) BIOS-e820: 00000000fec00000 - 00000000fec00400 (reserved) BIOS-e820: 00000000fee00000 - 00000000fee01000 (reserved) BIOS-e820: 00000000fff80000 - 0000000100000000 (reserved) Scanning NUMA topology in Northbridge 24 Number of nodes 2 (10010) Node 0 using interleaving mode 1/0 No NUMA configuration found Faking a node at 0000000000000000-00000000dff70000 Bootmem setup node 0 0000000000000000-00000000dff70000 No mptable found. ACPI: RSDP (v002 PTLTD ) @ 0x00000000000f6f80 ACPI: XSDT (v001 PTLTD XSDT 0x06040000 LTP 0x00000000) @ 0x00000000dff7b7f5 ACPI: FADT (v003 AMD HAMMER 0x06040000 PTEC 0x000f4240) @ 0x00000000dff7ee23 ACPI: ASF! (v032 TYAN TYANASF 0x06040000 PTL 0x00000001) @ 0x00000000dff7ef17 ACPI: MADT (v001 PTLTD APIC 0x06040000 LTP 0x00000000) @ 0x00000000dff7ef8a ACPI: DSDT (v001 AMD-K8 AMDACPI 0x06040000 MSFT 0x0100000e) @ 0x0000000000000000 ACPI: LAPIC (acpi_id[0x00] lapic_id[0x00] enabled) Processor #0 15:5 APIC version 16 ACPI: LAPIC (acpi_id[0x01] lapic_id[0x01] enabled) Processor #1 15:5 APIC version 16 ACPI: LAPIC_NMI (acpi_id[0x00] high edge lint[0x1]) ACPI: LAPIC_NMI (acpi_id[0x01] high edge lint[0x1]) ACPI: IOAPIC (id[0x02] address[0xfec00000] gsi_base[0]) IOAPIC[0]: apic_id 2, version 17, address 0xfec00000, GSI 0-23 ACPI: IOAPIC (id[0x03] address[0xe0000000] gsi_base[24]) IOAPIC[1]: apic_id 3, version 17, address 0xe0000000, GSI 24-27 ACPI: IOAPIC (id[0x04] address[0xe0001000] gsi_base[28]) IOAPIC[2]: apic_id 4, version 17, address 0xe0001000, GSI 28-31 ACPI: INT_SRC_OVR (bus 0 bus_irq 0 global_irq 2 high edge) Using ACPI (MADT) for SMP configuration information Checking aperture... CPU 0: aperture @ e8000000 size 128 MB CPU 1: aperture @ e8000000 size 128 MB Built 1 zonelists Kernel command line: root=/dev/hda1 vga=normal selinux=0 console=ttyS0,38400 resume=/dev/hda2 desktop elevator=as showopts Initializing CPU#0 PID hash table entries: 4096 (order: 12, 131072 bytes) time.c: Using 1.193182 MHz PIT timer. time.c: Detected 2390.097 MHz processor. Console: colour VGA+ 80x25 Dentry cache hash table entries: 524288 (order: 10, 4194304 bytes) Inode-cache hash table entries: 262144 (order: 9, 2097152 bytes) Memory: 3605604k/3669440k available (2452k kernel code, 0k reserved, 942k data, 220k init) Security Scaffold v1.0.0 initialized SELinux: Disabled at boot. Mount-cache hash table entries: 256 (order: 0, 4096 bytes) CPU: L1 I Cache: 64K (64 bytes/line), D cache 64K (64 bytes/line) CPU: L2 Cache: 1024K (64 bytes/line) Using local APIC NMI watchdog using perfctr0 CPU: L1 I Cache: 64K (64 bytes/line), D cache 64K (64 bytes/line) CPU: L2 Cache: 1024K (64 bytes/line) CPU0: AMD Opteron(tm) Processor 250 stepping 0a per-CPU timeslice cutoff: 1024.00 usecs. task migration cache decay timeout: 2 msecs. Booting processor 1/1 rip 6000 rsp 100dff25f58 Initializing CPU#1 CPU: L1 I Cache: 64K (64 bytes/line), D cache 64K (64 bytes/line) CPU: L2 Cache: 1024K (64 bytes/line) AMD Opteron(tm) Processor 250 stepping 0a Total of 2 processors activated (9469.95 BogoMIPS). Using local APIC timer interrupts. Detected 12.448 MHz APIC timer. checking TSC synchronization across 2 CPUs: passed. time.c: Using PIT/TSC based timekeeping. Brought up 2 CPUs checking if image is initramfs...it isn't (no cpio magic); looks like an initrd Looking for DSDT in initrd ...No customized DSDT found in initrd! NET: Registered protocol family 16 PCI: Using configuration type 1 mtrr: v2.0 (20020519) general protection fault: 0000 [1] SMP CPU 1 Modules linked in: Pid: 0, comm: swapper Tainted: MG (2.6.8-24-smp 20041006091623) RIP: 0010:[<ffffffff8011a8fe>] <ffffffff8011a8fe>{generic_set_all+318} RSP: 0018:00000100dff3bf48 EFLAGS: 00010006 RAX: 000000001e1e1e1e RBX: 0000000000000000 RCX: 0000000000000250 RDX: 000000001e1e1e1e RSI: 0000000000000000 RDI: 0000000000000000 RBP: 0000000000000008 R08: 0000000006060606 R09: ffffffff8045d608 R10: 0000000000000000 R11: 0000000006060606 R12: 0000000000000000 R13: 0000000000000008 R14: 00000100dfce97c0 R15: 0000000000000c00 FS: 0000000000000000(0000) GS:ffffffff804e2500(0000) knlGS:0000000000000000 CS: 0010 DS: 0018 ES: 0018 CR0: 00000000c005003b CR2: 0000000000000000 CR3: 00000000dff28000 CR4: 0000000000000060 Process swapper (pid: 0, threadinfo 00000100dff24000, task 00000100047b0030) Stack: ffffffff804e3290 00000100dff21ed8 0000000000000001 0000000000000000 0000000000000000 0000000000000000 0000000000000000 ffffffff8011943b 0000000000000001 0000000000000006 Call Trace:<IRQ> <ffffffff8011943b>{ipi_handler+75} <ffffffff8011c900>{smp_call_function_interrupt+64} <ffffffff8010f5c0>{default_idle+0} <ffffffff80110f2f>{call_function_interrupt+99} <EOI> <ffffffff8010f5e0>{default_idle+32} <ffffffff8010f9ea>{cpu_idle+26} Code: 0f 30 41 ba 01 00 00 00 31 ff 8d 8f 58 02 00 00 0f 32 41 89 RIP <ffffffff8011a8fe>{generic_set_all+318} RSP <00000100dff3bf48> <0>Kernel panic - not syncing: Aiee, killing interrupt handler! NMI Watchdog detected LOCKUP on CPU0, registers: CPU 0 Modules linked in: Pid: 1, comm: swapper Tainted: MG (2.6.8-24-smp 20041006091623) RIP: 0010:[<ffffffff80119540>] <ffffffff80119540>{set_mtrr+208} RSP: 0000:00000100dff21ec8 EFLAGS: 00000002 RAX: 0000000000000001 RBX: 00000000ffffffff RCX: 0000000000000002 RDX: 0000ffff0000ffff RSI: 0000000000000000 RDI: ffffffff8045d790 RBP: 0000000000000000 R08: 0000000000000000 R09: 0000000000000000 R10: 0000000000040000 R11: 0000000000000000 R12: 0000000000000000 R13: 0000000000000000 R14: 0000000000000000 R15: 0000000000000000 FS: 0000000000000000(0000) GS:ffffffff804e2480(0000) knlGS:0000000000000000 CS: 0010 DS: 0018 ES: 0018 CR0: 000000008005003b CR2: 0000000000000000 CR3: 0000000000101000 CR4: 00000000000006e0 Process swapper (pid: 1, threadinfo 00000100dff20000, task 00000100047b07e0) Stack: 0000000000000000 0000000000000246 0000000100000001 0000000000000000 0000000000000000 00000000ffffffff 00000100047edec0 0000000000000008 0000000000000000 0000000000000000 Call Trace:<ffffffff804f1360>{mtrr_init+352} <ffffffff8010c2f2>{init+514} <ffffffff8011129f>{child_rip+8} <ffffffff8010c0f0>{init+0} <ffffffff80111297>{child_rip+0} Code: f3 90 8b 44 24 10 85 c0 75 f6 be 08 00 00 00 48 c7 c7 90 d7 console shuts up ...

Rainer Koenig

16 Dec 16 Dec

10:27

New subject: [suse-amd64] Dual Opteron crashes with Kernel Panic

Hi Andreas, Andreas Wahlert <Andreas.Wahlert@gmx.de> writes:

...

I'm in a serious discussion with the FSC Celsius man in Augsburg. perhaps there are any results today or monday.

Can you give me more contact info about that Celsius man in Augsburg? Per definition it should be me... but I just read about this because your problem is also showing up at other customers of our CELSIUS V810 and that other customer pointed us to this thread. Sorry that this problem didn't get my attention before. And yes, I can perfectly reproduce it here on my V810 as well. And since we made a donation of a CELSIUS V810 to SuSE this year I wonder if they can reproduce it as well on their machine. I will try to see what I can do or find out about this and post this to to the suse-amd64 mailing list. Regards Rainer -- Dipl.-Inf. (FH) Rainer Koenig Project Manager Linux Fujitsu Siemens Computers VP BC E SW OS Phone: +49-821-804-3321 Fax: +49-821-804-2131

Andi Kleen

11:01

New subject: [suse-amd64] Dual Opteron crashes with Kernel Panic

On Thu, Dec 16, 2004 at 11:27:45AM +0100, Rainer Koenig wrote:

...

Hi Andreas,

Andreas Wahlert <Andreas.Wahlert@gmx.de> writes:

...
I'm in a serious discussion with the FSC Celsius man in Augsburg. perhaps there are any results today or monday.

Can you give me more contact info about that Celsius man in Augsburg? Per definition it should be me... but I just read about this because your problem is also showing up at other customers of our CELSIUS V810 and that other customer pointed us to this thread. Sorry that this problem didn't get my attention before.

And yes, I can perfectly reproduce it here on my V810 as well.

It probably depends on the amount of memory installed and the BIOS version. Basically the crash happens when the kernel tries to duplicate MTRR setup done by the BIOS to the other CPU, so likely there is some issue in the original MTRR setup. I haven't heard a report for it from the V810 at suse, perhaps it doesn't show it. There is a bug open for the issue, but I haven't had time to look at it in detail yet. -Andi

Eric Whiting

16:04

New subject: [suse-amd64] Dual Opteron crashes with Kernel Panic

Here is something else to look at: (perhaps not related -- I'm new to this thread) Sun Document ID 57680 lists an issue with their dual opteron box running pci-x cards in 133Mhz slots. It also links to a bios update and an AMD errata page that lists other issues. http://www.amd.com/us-en/assets/content_type/white_papers_and_tech_docs/2631.... We have a cluster of dual opterons. A few of these have had kernel panic issues, but this is usually fixed with a replacement cpu or board. Most of our problems appear to be hardware related and not kernel. eric Andi Kleen wrote:

...

On Thu, Dec 16, 2004 at 11:27:45AM +0100, Rainer Koenig wrote:

...
Hi Andreas,

Andreas Wahlert <Andreas.Wahlert@gmx.de> writes:

...
I'm in a serious discussion with the FSC Celsius man in Augsburg. perhaps there are any results today or monday.

Can you give me more contact info about that Celsius man in Augsburg? Per definition it should be me... but I just read about this because your problem is also showing up at other customers of our CELSIUS V810 and that other customer pointed us to this thread. Sorry that this problem didn't get my attention before.

And yes, I can perfectly reproduce it here on my V810 as well.

It probably depends on the amount of memory installed and the BIOS version. Basically the crash happens when the kernel tries to duplicate MTRR setup done by the BIOS to the other CPU, so likely there is some issue in the original MTRR setup.

I haven't heard a report for it from the V810 at suse, perhaps it doesn't show it.

There is a bug open for the issue, but I haven't had time to look at it in detail yet.

-Andi

Shawn Faulkingham

17:49

New subject: [suse-amd64] Dual Opteron crashes with Kernel Panic

I would like to verify this as well...we had a v20z that would randomly shutdown on us. We replaced a faulty fan, and no more shutdowns. I then got the newest kernel from kernel.org, and was able to get IPMI working on it as well...Suse 9.0 Pro with 2.6.9 kernel... On Thu, 2004-12-16 at 08:47 -0700, Eric Whiting wrote:

...

Here is something else to look at: (perhaps not related -- I'm new to this thread)

Sun Document ID 57680 lists an issue with their dual opteron box running pci-x cards in 133Mhz slots. It also links to a bios update and an AMD errata page that lists other issues. http://www.amd.com/us-en/assets/content_type/white_papers_and_tech_docs/2631....

We have a cluster of dual opterons. A few of these have had kernel panic issues, but this is usually fixed with a replacement cpu or board. Most of our problems appear to be hardware related and not kernel.

eric

Andi Kleen wrote:

...
On Thu, Dec 16, 2004 at 11:27:45AM +0100, Rainer Koenig wrote:

...
Hi Andreas,

Andreas Wahlert <Andreas.Wahlert@gmx.de> writes:

...
I'm in a serious discussion with the FSC Celsius man in Augsburg. perhaps there are any results today or monday.

Can you give me more contact info about that Celsius man in Augsburg? Per definition it should be me... but I just read about this because your problem is also showing up at other customers of our CELSIUS V810 and that other customer pointed us to this thread. Sorry that this problem didn't get my attention before.

And yes, I can perfectly reproduce it here on my V810 as well.

It probably depends on the amount of memory installed and the BIOS version. Basically the crash happens when the kernel tries to duplicate MTRR setup done by the BIOS to the other CPU, so likely there is some issue in the original MTRR setup.

I haven't heard a report for it from the V810 at suse, perhaps it doesn't show it.

There is a bug open for the issue, but I haven't had time to look at it in detail yet.

-Andi

-- Shawn Faulkingham Indoff Inc. http://www.indoff.com

Andreas Wahlert

20:31

New subject: AW: [suse-amd64] Dual Opteron crashes with Kernel Panic

Hi, Which kernel version have you testet?? (uname -a) It's no hardware issue. We get this error on all the maschines. We have this problem only with SuSE 9.2, the "out-of-the-box" kernel and the next patched kernels. Regards Andreas -----Ursprüngliche Nachricht----- Von: Shawn Faulkingham [mailto:shawn.faulkingham@indoff.com] Gesendet: Donnerstag, 16. Dezember 2004 18:50 An: suse-amd64@suse.com Betreff: Re: [suse-amd64] Dual Opteron crashes with Kernel Panic I would like to verify this as well...we had a v20z that would randomly shutdown on us. We replaced a faulty fan, and no more shutdowns. I then got the newest kernel from kernel.org, and was able to get IPMI working on it as well...Suse 9.0 Pro with 2.6.9 kernel... On Thu, 2004-12-16 at 08:47 -0700, Eric Whiting wrote:

...

Here is something else to look at: (perhaps not related -- I'm new to this thread)

Sun Document ID 57680 lists an issue with their dual opteron box running pci-x cards in 133Mhz slots. It also links to a bios update and an AMD

...

errata page that lists other issues.

http://www.amd.com/us-en/assets/content_type/white_papers_and_tech_docs/ 26310.pdf.

...

We have a cluster of dual opterons. A few of these have had kernel panic issues, but this is usually fixed with a replacement cpu or board.

Most

...

of our problems appear to be hardware related and not kernel.

eric

Andi Kleen wrote:

...
On Thu, Dec 16, 2004 at 11:27:45AM +0100, Rainer Koenig wrote:

...
Hi Andreas,

Andreas Wahlert <Andreas.Wahlert@gmx.de> writes:

...
I'm in a serious discussion with the FSC Celsius man in Augsburg. perhaps there are any results today or monday.

Can you give me more contact info about that Celsius man in Augsburg? Per definition it should be me... but I just read about this because your problem is also showing up at other customers of our CELSIUS V810 and that other customer pointed us to this thread. Sorry that this problem didn't get my attention before.

And yes, I can perfectly reproduce it here on my V810 as well.

It probably depends on the amount of memory installed and the BIOS version. Basically the crash happens when the kernel tries to duplicate MTRR setup done by the BIOS to the other CPU, so likely there is some issue in the original MTRR setup.

I haven't heard a report for it from the V810 at suse, perhaps it doesn't show it.

There is a bug open for the issue, but I haven't had time to look at it in detail yet.

-Andi

-- Shawn Faulkingham Indoff Inc. http://www.indoff.com -- Check the List-Unsubscribe header to unsubscribe For additional commands, email: suse-amd64-help@suse.com

Shawn Faulkingham

21:42

New subject: AW: [suse-amd64] Dual Opteron crashes with Kernel Panic

Spoke to soon...we have the 2.6.8.1 kernel in production...IPMI is working uname -a Linux cfl-squidward 2.6.8.1-smp #7 SMP Mon Oct 25 12:27:51 CDT 2004 x86_64 x86_64 x86_64 GNU/Linux On Thu, 2004-12-16 at 21:31 +0100, Andreas Wahlert wrote:

...

Hi,

Which kernel version have you testet?? (uname -a) It's no hardware issue. We get this error on all the maschines.

We have this problem only with SuSE 9.2, the "out-of-the-box" kernel and the next patched kernels.

Regards

Andreas

-----Ursprüngliche Nachricht----- Von: Shawn Faulkingham [mailto:shawn.faulkingham@indoff.com] Gesendet: Donnerstag, 16. Dezember 2004 18:50 An: suse-amd64@suse.com Betreff: Re: [suse-amd64] Dual Opteron crashes with Kernel Panic

I would like to verify this as well...we had a v20z that would randomly shutdown on us. We replaced a faulty fan, and no more shutdowns. I then got the newest kernel from kernel.org, and was able to get IPMI working on it as well...Suse 9.0 Pro with 2.6.9 kernel...

On Thu, 2004-12-16 at 08:47 -0700, Eric Whiting wrote:

...
Here is something else to look at: (perhaps not related -- I'm new to this thread)

Sun Document ID 57680 lists an issue with their dual opteron box running pci-x cards in 133Mhz slots. It also links to a bios update and an AMD

...
errata page that lists other issues.

http://www.amd.com/us-en/assets/content_type/white_papers_and_tech_docs/ 26310.pdf.

...
We have a cluster of dual opterons. A few of these have had kernel panic issues, but this is usually fixed with a replacement cpu or board.

Most

...
of our problems appear to be hardware related and not kernel.

eric

Andi Kleen wrote:

...
On Thu, Dec 16, 2004 at 11:27:45AM +0100, Rainer Koenig wrote:

...
Hi Andreas,

Andreas Wahlert <Andreas.Wahlert@gmx.de> writes:

...
I'm in a serious discussion with the FSC Celsius man in Augsburg. perhaps there are any results today or monday.

Can you give me more contact info about that Celsius man in Augsburg? Per definition it should be me... but I just read about this because your problem is also showing up at other customers of our CELSIUS V810 and that other customer pointed us to this thread. Sorry that this problem didn't get my attention before.

And yes, I can perfectly reproduce it here on my V810 as well.

It probably depends on the amount of memory installed and the BIOS version. Basically the crash happens when the kernel tries to duplicate MTRR setup done by the BIOS to the other CPU, so likely there is some issue in the original MTRR setup.

I haven't heard a report for it from the V810 at suse, perhaps it doesn't show it.

There is a bug open for the issue, but I haven't had time to look at it in detail yet.

-Andi

-- Shawn Faulkingham Indoff Inc. http://www.indoff.com

-- Check the List-Unsubscribe header to unsubscribe For additional commands, email: suse-amd64-help@suse.com

-- Shawn Faulkingham Indoff Inc. http://www.indoff.com

Thomas Renninger

17 Dec 17 Dec

18:05

New subject: [suse-amd64] Dual Opteron crashes with Kernel Panic

Andi Kleen wrote:

...

On Thu, Dec 16, 2004 at 11:27:45AM +0100, Rainer Koenig wrote:

...
Hi Andreas,

Andreas Wahlert <Andreas.Wahlert@gmx.de> writes:

...
I'm in a serious discussion with the FSC Celsius man in Augsburg. perhaps there are any results today or monday.

Can you give me more contact info about that Celsius man in Augsburg? Per definition it should be me... but I just read about this because your problem is also showing up at other customers of our CELSIUS V810 and that other customer pointed us to this thread. Sorry that this problem didn't get my attention before.

And yes, I can perfectly reproduce it here on my V810 as well.

I could reproduce it here, too: oops at the exactly same function: RIP: 0010:[<ffffffff8011a8fe>] <ffffffff8011a8fe>{generic_set_all+318} The oops does not apear on a SLES9-SP1 installation However, I could solve the problem by installing a current BIOS version: It's a tyan S2885 board. The previos BIOS version was from 01.2004, the new one (v. 2885_202) is from 19.05.2004. what the update fixes (some mtrr enhancements are mentioned as well) see: http://www.tyan.com/support/html/b_s2885.html I got some scsi controller Problems on SLES9 (9.2 boots properly) now after the update, I cannot say anything on this for now, I will investigate further on this on Monday. Thomas

...

It probably depends on the amount of memory installed and the BIOS version. Basically the crash happens when the kernel tries to duplicate MTRR setup done by the BIOS to the other CPU, so likely there is some issue in the original MTRR setup.

I haven't heard a report for it from the V810 at suse, perhaps it doesn't show it.

There is a bug open for the issue, but I haven't had time to look at it in detail yet.

-Andi

Andreas Wahlert

18 Dec 18 Dec

14:19

New subject: AW: [suse-amd64] Dual Opteron crashes with Kernel Panic

Wow, Do you have the FSC 810 patched with a original Tyan Bios?? Is that the way, we should go for our customers rainer?? Or should we wait for a official patch for the 1.05 FSC bios revision?? friendly regards Andreas -----Ursprüngliche Nachricht----- Von: Thomas Renninger [mailto:trenn@suse.de] Gesendet: Freitag, 17. Dezember 2004 19:05 An: Andi Kleen Cc: Rainer Koenig; suse-amd64@suse.com; Andreas Wahlert Betreff: Re: [suse-amd64] Dual Opteron crashes with Kernel Panic Andi Kleen wrote:

...

On Thu, Dec 16, 2004 at 11:27:45AM +0100, Rainer Koenig wrote:

...
Hi Andreas,

Andreas Wahlert <Andreas.Wahlert@gmx.de> writes:

...
I'm in a serious discussion with the FSC Celsius man in Augsburg. perhaps there are any results today or monday.

Can you give me more contact info about that Celsius man in Augsburg? Per definition it should be me... but I just read about this because your problem is also showing up at other customers of our CELSIUS V810 and that other customer pointed us to this thread. Sorry that this problem didn't get my attention before.

And yes, I can perfectly reproduce it here on my V810 as well.

...

It probably depends on the amount of memory installed and the BIOS version. Basically the crash happens when the kernel tries to duplicate MTRR setup done by the BIOS to the other CPU, so likely there is some issue in the original MTRR setup.

I haven't heard a report for it from the V810 at suse, perhaps it doesn't show it.

There is a bug open for the issue, but I haven't had time to look at it in detail yet.

-Andi

7308

Age (days ago)

7331

Last active (days ago)

List overview

Download

16 comments

7 participants

participants (7)

Andi Kleen
Andreas Wahlert
Andreas Wahlert
Eric Whiting
Rainer Koenig
Shawn Faulkingham
Thomas Renninger