http://bugzilla.novell.com/show_bug.cgi?id=576681 http://bugzilla.novell.com/show_bug.cgi?id=576681#c67 --- Comment #67 from Jiri Bohac <jbohac@novell.com> 2010-04-06 19:14:51 UTC --- I can now reproduce the problem as well. After more debugging I see that the machine is stuck in an endless loop of page faults. The page fault is triggered by the memset at fec0000 and the page fault is thought to be "spurious" (stale TLB entry) by the page fault handler, so the kernel does nothing, the STOS instruction of memset is restarted and the pagefault triggers again. The reason code for the page fault is 3, that is a protection fault during a write operation. Looking at the PMD entry and PTE of the fec00000 page, the page is set to be writeable, so I don't understand why this happens. The i386 specification says that the TLB should be flushed automatically after a PF trap, and that is why the PF handler does nothing if it believes the PF was "spurious". So, this could either be a VB bug (because it is VB that emulates the paging, traps, etc in the guest), or there is some other reason why a page protection fault can happen besides the permission bits in the PTE/PMD entry. (In reply to comment #65)
[ 0.000000] ACPI: IOAPIC (id[0x01] address[0xfec00000] gsi_base[0]) ... [ 44.781407] * pcpu debug:going to memset: chunk=e8e71140, cpu=0, off=8832, size=64, addr=fec00000
Yes, I also thought this was the reason at first, but I think the IOAPIC address refers to a physical address, while the allocated memory that memset faults on is at virtual address fec00000, right?
2.6.34-rcX has a random memory corruption bug which is showing up as various boot failures. Yinghai has a patch.
http://thread.gmane.org/gmane.linux.kernel/963616/focus=964914
This looks pretty deterministic, It fails at exactly the same place for more people.
-rc3 has the fix which got committed to suse kernel repo a couple of days ago. It should soon appear on Factory.
Also, this bug is probably going to stop appearing with the new kernel in Factory, because I recently switched IPv6 to be compiled-in. Most likely, this bug is not related to IPv6 at all and it is just a coincidence that the order in which the install CD image loads kernel modules makes IPv6 be the first one to need a new allocation of pcpu data and trigger this bug. With IPv6 compiled in, this order is going to change and the bug will either be triggered by something else or will not show at all. But even if this bug disappears, I think it is worth finding out what the cause was, before it causes other headaches in a different situation. More debugging soon, I currently have some more urgents bugs to deal with. -- Configure bugmail: http://bugzilla.novell.com/userprefs.cgi?tab=email ------- You are receiving this mail because: ------- You are on the CC list for the bug.