[Bug 228344] New: Acer Ferrari 1000 does not suspend - expect PCI problems
https://bugzilla.novell.com/show_bug.cgi?id=228344 Summary: Acer Ferrari 1000 does not suspend - expect PCI problems Product: openSUSE 10.2 Version: Final Platform: Other OS/Version: Other Status: NEW Severity: Normal Priority: P5 - None Component: Kernel AssignedTo: gregkh@novell.com ReportedBy: trenn@novell.com QAContact: qa@suse.de CC: stefan.fent@novell.com When trying to suspend the machine it stops quite early, last message on console: suspend: Snapshotting the system dmesg shows some PCI problems and I think they could be related (machine totally freezes, so I expect this happening when suspend tries to free the problematic PCI ports?). -- Configure bugmail: https://bugzilla.novell.com/userprefs.cgi?tab=email ------- You are receiving this mail because: ------- You are on the CC list for the bug, or are watching someone who is.
https://bugzilla.novell.com/show_bug.cgi?id=228344 ------- Comment #1 from trenn@novell.com 2006-12-13 11:30 MST ------- Created an attachment (id=109606) --> (https://bugzilla.novell.com/attachment.cgi?id=109606&action=view) Dmesg with PCI Debug enabled Debug output did not show much more... There are two IMO potentially dangerous messages: PCI: Bus #09 (-#0c) is hidden behind transparent bridge #08 (-#09) (try 'pci=assign-busses') Please report the result to linux-kernel to fix this permanently Harmless? This one seems to be more problematic, should be the PCIe ports?: PCI: Using ACPI for IRQ routing PCI: If a device doesn't work, try "pci=routeirq". If it helps, post a report PCI: Cannot allocate resource region 7 of bridge 0000:00:04.0 PCI: Cannot allocate resource region 8 of bridge 0000:00:04.0 PCI: Cannot allocate resource region 7 of bridge 0000:00:05.0 PCI: Cannot allocate resource region 8 of bridge 0000:00:05.0 PCI: Cannot allocate resource region 9 of bridge 0000:00:05.0 PCI: Cannot allocate resource region 7 of bridge 0000:00:06.0 PCI: Cannot allocate resource region 8 of bridge 0000:00:06.0 PCI-DMA: Disabling IOMMU. These three get disabled then: PCI: Bridge: 0000:00:04.0 IO window: disabled. MEM window: disabled. PREFETCH window: disabled. PCI: Bridge: 0000:00:05.0 IO window: disabled. MEM window: disabled. PREFETCH window: disabled. PCI: Bridge: 0000:00:06.0 IO window: disabled. MEM window: disabled. PREFETCH window: disabled. -- Configure bugmail: https://bugzilla.novell.com/userprefs.cgi?tab=email ------- You are receiving this mail because: ------- You are on the CC list for the bug, or are watching someone who is.
https://bugzilla.novell.com/show_bug.cgi?id=228344 trenn@novell.com changed: What |Removed |Added ---------------------------------------------------------------------------- CC| |pavel@novell.com ------- Comment #2 from trenn@novell.com 2006-12-13 11:32 MST ------- Rest of the machine works fine, could it be that the machine freezes when suspend code tries to touch the broken devices? -- Configure bugmail: https://bugzilla.novell.com/userprefs.cgi?tab=email ------- You are receiving this mail because: ------- You are on the CC list for the bug, or are watching someone who is.
https://bugzilla.novell.com/show_bug.cgi?id=228344 pavel@novell.com changed: What |Removed |Added ---------------------------------------------------------------------------- CC| |seife@novell.com Summary|Acer Ferrari 1000 does not |Acer Ferrari 1000 does not suspend to disk - PCI |suspend - expect PCI |problems? |problems | ------- Comment #3 from pavel@novell.com 2006-12-13 13:38 MST ------- Try usual debugging: boot with init=/bin/bash, try suspend from there. If it helps, binary search modules to find the broken one. We are talking about suspend-to-disk, right? (What about suspend-to-ram, btw? Has it ever worked before?) -- Configure bugmail: https://bugzilla.novell.com/userprefs.cgi?tab=email ------- You are receiving this mail because: ------- You are on the CC list for the bug, or are watching someone who is.
https://bugzilla.novell.com/show_bug.cgi?id=228344 stefan.fent@novell.com changed: What |Removed |Added ---------------------------------------------------------------------------- CC| |bodo.bauer@novell.com ------- Comment #4 from stefan.fent@novell.com 2006-12-14 02:07 MST ------- Yes, suspend-to-disk Suspend-to-ram doesn't work either (AFAIR, I tried that, too) -- Configure bugmail: https://bugzilla.novell.com/userprefs.cgi?tab=email ------- You are receiving this mail because: ------- You are on the CC list for the bug, or are watching someone who is.
https://bugzilla.novell.com/show_bug.cgi?id=228344 ------- Comment #5 from trenn@novell.com 2006-12-14 02:54 MST ------- Did that. Could it be that fb must be off? With vga= removed I now see oopses with init=/bin/bash. First I got an oops and calltrace to thermal resume (I don't see what could go wrong there, but I couldn't scroll up to the beginning...). Then I tried again with thermal module also removed before and it oopsed again. Now I am quite sure it has to do with above PCI warnings at boot time: Call Trace (written by hand): pci_restore_msi_state pci_restore_state pci_set_master pcie_portdrv_resume resume_device dpm_resume device_resume pm_suspend_disk .. RIP: kfree+0x4d/0x1bb Again I couldn't scroll up to the very beginning (but call trace is complete), I can try to setup a firewire debug console if really needed... -- Configure bugmail: https://bugzilla.novell.com/userprefs.cgi?tab=email ------- You are receiving this mail because: ------- You are on the CC list for the bug, or are watching someone who is.
https://bugzilla.novell.com/show_bug.cgi?id=228344 ------- Comment #6 from pavel@novell.com 2006-12-14 12:30 MST ------- It would be nice to find out which device does this... Can you try passing nomsi parameter to see if it helps? -- Configure bugmail: https://bugzilla.novell.com/userprefs.cgi?tab=email ------- You are receiving this mail because: ------- You are on the CC list for the bug, or are watching someone who is.
https://bugzilla.novell.com/show_bug.cgi?id=228344 gregkh@novell.com changed: What |Removed |Added ---------------------------------------------------------------------------- Status|NEW |NEEDINFO Info Provider| |trenn@novell.com -- Configure bugmail: https://bugzilla.novell.com/userprefs.cgi?tab=email ------- You are receiving this mail because: ------- You are on the CC list for the bug, or are watching someone who is.
https://bugzilla.novell.com/show_bug.cgi?id=228344 stefan.fent@novell.com changed: What |Removed |Added ---------------------------------------------------------------------------- AssignedTo|gregkh@novell.com |bk@novell.com Status|NEEDINFO |NEW Info Provider|trenn@novell.com | -- Configure bugmail: https://bugzilla.novell.com/userprefs.cgi?tab=email ------- You are receiving this mail because: ------- You are on the CC list for the bug, or are watching someone who is.
https://bugzilla.novell.com/show_bug.cgi?id=228344 bk@novell.com changed: What |Removed |Added ---------------------------------------------------------------------------- Status|NEW |ASSIGNED ------- Comment #7 from bk@novell.com 2007-01-10 09:59 MST ------- pci=nomsi changes the oops from the kfree() at the end of pci_restore_msi_state() to a oops occuring at some random locations, I have seen about 3 locations, so for me, not only the oops in the kfree in pci_restore_msi_state, but also the other random oopses look like to be only possible because of corrupted memory. The first place where something went wrong during boot were these messages, from Comment #1 of Thomas: PCI: Cannot allocate resource region 7 of bridge 0000:00:04.0 .. and so on also for 00.05 and 00.06. They are followed by: PCI: Bridge: 0000:00:04.0 IO window: disabled. MEM window: disabled. PREFETCH window: disabled. PCI: Bridge: 0000:00:05.0 IO window: disabled. MEM window: disabled. PREFETCH window: disabled. PCI: Bridge: 0000:00:06.0 IO window: disabled. MEM window: disabled. PREFETCH window: disabled. and by: PCI: Setting latency timer of device 0000:00:04.0 to 64 pcie_portdrv_probe->Dev[5a36:1002] has invalid IRQ. Check vendor BIOS assign_interrupt_mode Found MSI capability Allocate Port Service[0000:00:04.0:pcie00] Allocate Port Service[0000:00:04.0:pcie01] Allocate Port Service[0000:00:04.0:pcie03] PCI: Setting latency timer of device 0000:00:05.0 to 64 pcie_portdrv_probe->Dev[5a37:1002] has invalid IRQ. Check vendor BIOS assign_interrupt_mode Found MSI capability Allocate Port Service[0000:00:05.0:pcie00] Allocate Port Service[0000:00:05.0:pcie01] Allocate Port Service[0000:00:05.0:pcie03] PCI: Setting latency timer of device 0000:00:06.0 to 64 pcie_portdrv_probe->Dev[5a38:1002] has invalid IRQ. Check vendor BIOS assign_interrupt_mode Found MSI capability Allocate Port Service[0000:00:06.0:pcie00] Allocate Port Service[0000:00:06.0:pcie01] Allocate Port Service[0000:00:06.0:pcie03] PCI: Setting latency timer of device 0000:00:07.0 to 64 pcie_portdrv_probe->Dev[5a39:1002] has invalid IRQ. Check vendor BIOS assign_interrupt_mode Found MSI capability Allocate Port Service[0000:00:07.0:pcie00] Allocate Port Service[0000:00:07.0:pcie01] Allocate Port Service[0000:00:07.0:pcie02] Allocate Port Service[0000:00:07.0:pcie03] The bridges in question are shown in lspci as: 00:00.0 Host bridge: ATI Technologies Inc RS480 Host Bridge (rev 10) 00:01.0 PCI bridge: ATI Technologies Inc RS480 PCI Bridge 00:04.0 PCI bridge: ATI Technologies Inc RS480 PCI Bridge 00:05.0 PCI bridge: ATI Technologies Inc Unknown device 5a37 00:06.0 PCI bridge: ATI Technologies Inc RS480 PCI Bridge 00:07.0 PCI bridge: ATI Technologies Inc RS480 PCI Bridge And lspci -t shows -[0000:00]-+-00.0 +-01.0-[0000:01]----05.0 +-04.0-[0000:02]-- +-05.0-[0000:03-05]-- +-06.0-[0000:06]-- I have looked at pcie_portdrv_probe(), which spills the most noticeable warning messages: static int __devinit pcie_portdrv_probe (struct pci_dev *dev, const struct pci_device_id *id ) { int status; status = pcie_port_device_probe(dev); if (status) return status; if (pci_enable_device(dev) < 0) return -ENODEV; pci_set_master(dev); if (!dev->irq) { printk(KERN_WARNING "%s->Dev[%04x:%04x] has invalid IRQ. Check vendor BIOS\n", __FUNCTION__, dev->device, dev->vendor); } if (pcie_port_device_register(dev)) { pci_disable_device(dev); return -ENOMEM; } return 0; } I have built 2.6.20-rc4-git2 now to test if it helps to abort in this function when the warning is triggered, but with init=/bin/bash, it but it panics suspend2disk after printing Suspending console(s) without further messages, even with vanilla code, Caps-Lock blinking and does the automatic reboot on panic which I have set up through sysctl. pci=noacpi does not even help, only acpi=off makes suspend2disk succeed. The boot warnings which I quoted also appear on other PCIE machines which Thomas has got, and we want the messages fixed there too. Pavel, if you have ideas, they would be welcome. -- Configure bugmail: https://bugzilla.novell.com/userprefs.cgi?tab=email ------- You are receiving this mail because: ------- You are on the CC list for the bug, or are watching someone who is.
https://bugzilla.novell.com/show_bug.cgi?id=228344 ------- Comment #8 from bk@novell.com 2007-01-10 11:10 MST ------- I have changed pcie_portdrv_probe() to do nothing (verified in the dmesg) on the 3 strange PCI bridges, but it didn't help. I have also compared the dmesg from Thomas with the one from 2.6.20-rc4-git2 but didn't get new ideas from that either as 2.6.20-rc4-git2 (with the same config settings) didn't help anyways. I'm going to rebuild 2.6.20-rc4-git2 with debugging options on and consoles not disabled during suspend and see if I get more info then. -- Configure bugmail: https://bugzilla.novell.com/userprefs.cgi?tab=email ------- You are receiving this mail because: ------- You are on the CC list for the bug, or are watching someone who is.
https://bugzilla.novell.com/show_bug.cgi?id=228344 ------- Comment #10 from bk@novell.com 2007-01-11 08:30 MST ------- I've got suspend on 2.6.20-rc4-git2 to succeed by aborting the data copy loop in kernel/power/snapshot.c:copy_data_pages() before it hits the page (src_pfn= 0x4000) which triggered the oops. With this hack which consists of if (pfn==0x4000) break the kernel suspends, but on resume it hangs after reading the pages from swap. Once, resume displayed vertical white bars on the display. Looks like the memory corruption also bites on 2.6.20-rc4-git2 (one could assume), or the kernel data pages used by suspend do no look like the suspend code expects them to be. -- Configure bugmail: https://bugzilla.novell.com/userprefs.cgi?tab=email ------- You are receiving this mail because: ------- You are on the CC list for the bug, or are watching someone who is.
https://bugzilla.novell.com/show_bug.cgi?id=228344 ------- Comment #11 from bk@novell.com 2007-01-11 09:48 MST ------- Created an attachment (id=112500) --> (https://bugzilla.novell.com/attachment.cgi?id=112500&action=view) minium kernel configuration used for testing suspend on Ferrari 1000 with 2.6.20-rc4-git2 It looks like that we can pretty much exclude the PCIE code from the list of suspects because in my current test kernel config, I disabled everything I could, including PCI Express support, MSI, MMCONFIG and SMP. All code which accesses the three PCI bridges which caused the warnings can also be removed from the list of subjects because I patched drivers/pci/probe.c to skip them during the probing and are treated as if they do not exist, so that they do not become known to these functions and the kernel (and are also not shown in lspci). -- Configure bugmail: https://bugzilla.novell.com/userprefs.cgi?tab=email ------- You are receiving this mail because: ------- You are on the CC list for the bug, or are watching someone who is.
https://bugzilla.novell.com/show_bug.cgi?id=228344 ------- Comment #12 from bk@novell.com 2007-01-11 10:03 MST ------- But what surprises me is that there must be (at least also) an error in the data which suspend uses to verify if it can actually save the page it wants to copy. To exclude pages which should not be copied, copy_data_pages calls saveable_page(pfn) before on every page before it starts copying, and double-check, I even added anther check directly into the copy loop, but not even then the check triggers and the kernel still panics in do_copy_page with this check added: memory_bm_position_reset(orig_bm); memory_bm_position_reset(copy_bm); do { pfn = memory_bm_next_pfn(orig_bm); + if (!saveable_page(pfn)) { + printk("copy_data_pages: not saveable: %p\n", pfn); + break; + } if (likely(pfn != BM_END_OF_MAP)) copy_data_page(memory_bm_next_pfn(copy_bm), pfn); } while (pfn != BM_END_OF_MAP); Very strange, Pavel, any idea? -- Configure bugmail: https://bugzilla.novell.com/userprefs.cgi?tab=email ------- You are receiving this mail because: ------- You are on the CC list for the bug, or are watching someone who is.
https://bugzilla.novell.com/show_bug.cgi?id=228344 ------- Comment #13 from bk@novell.com 2007-01-12 05:51 MST ------- I installed 10.2-i386 into a new partition and it was also hanging. Compiled 2.6.20-rc4-mm1 on it and suspend2disk and -resume works flawlessy there. Fix for SLED10 SP1 wanted... -- Configure bugmail: https://bugzilla.novell.com/userprefs.cgi?tab=email ------- You are receiving this mail because: ------- You are on the CC list for the bug, or are watching someone who is.
https://bugzilla.novell.com/show_bug.cgi?id=228344 bk@novell.com changed: What |Removed |Added ---------------------------------------------------------------------------- Status|ASSIGNED |NEEDINFO Info Provider| |pavel@novell.com ------- Comment #14 from bk@novell.com 2007-01-12 07:31 MST ------- Hm, 10.2-i386 does not hang on supend2disk if done from init=/bin/bash, and suspend2disk succeeds, but resume turn off power after reading the swap image. Trying to suspend 10.2-i386 from runlevel 1 (sulogin) showed quite me some errors and timeouts USB (ehci-hcd) and messages from pnp for some devices not supporting activation before the code triggers a reboot. The reboot also happens if I rmmod ehci-hcd and ohci1394 before the echo disk
/sys/power/state. Suspend from X shows no messages at all (black screen for a while) before it triggers the reboot.
Pavel, can't we show the kernel messages also when we suspend from higher runlevels than 1? -- Configure bugmail: https://bugzilla.novell.com/userprefs.cgi?tab=email ------- You are receiving this mail because: ------- You are on the CC list for the bug, or are watching someone who is.
https://bugzilla.novell.com/show_bug.cgi?id=228344 ------- Comment #15 from bk@novell.com 2007-01-12 10:16 MST ------- Created an attachment (id=112752) --> (https://bugzilla.novell.com/attachment.cgi?id=112752&action=view) kernel config for 2.6.20-rc4-mm1/i386 which gives working suspend from X This is the only kernel config with which I was able to suspend/resume the machine at all so far. It's i386, uniprocessor, but ACPI is on. Turn e.g. SMP on based on this config, and suspending from X will stop at saving ~23% and suspending from init 3 will stop at ~84% saved. Even with this kind of config, 2.6.20-rc4-mm1 still panics on the data page copy on x86_64 as ?I described in comment #10, so there are 2 different symtoms with this machine, maybe with the same cause?: - panic on kernel data page copying on x86_64 - problems with all but a extremely minimal config on i386 -- Configure bugmail: https://bugzilla.novell.com/userprefs.cgi?tab=email ------- You are receiving this mail because: ------- You are on the CC list for the bug, or are watching someone who is.
https://bugzilla.novell.com/show_bug.cgi?id=228344 ------- Comment #16 from bk@novell.com 2007-01-12 10:41 MST ------- The i386 SMP suspend hang was limited to the -mm1 kernel, plain i386 2.4.20-rc4 also suspends with SMP correctly now. -- Configure bugmail: https://bugzilla.novell.com/userprefs.cgi?tab=email ------- You are receiving this mail because: ------- You are on the CC list for the bug, or are watching someone who is.
https://bugzilla.novell.com/show_bug.cgi?id=228344 pavel@novell.com changed: What |Removed |Added ---------------------------------------------------------------------------- Status|NEEDINFO |ASSIGNED Info Provider|pavel@novell.com | ------- Comment #17 from pavel@novell.com 2007-01-12 16:46 MST ------- If you want to display messages from init 5... it should be enough to set console loglevel correctly, no? (killall klogd?) Yes, it looks like there's driver problem _plus_ some core problem, too... but 2.6.20-rc4 works in both respects, right? Is CONFIG_PAE set it any of those kernels? That's known broken and pretty much unfixable. -- Configure bugmail: https://bugzilla.novell.com/userprefs.cgi?tab=email ------- You are receiving this mail because: ------- You are on the CC list for the bug, or are watching someone who is.
https://bugzilla.novell.com/show_bug.cgi?id=228344 ------- Comment #18 from bk@novell.com 2007-01-17 11:14 MST ------- Regarding messages: Sure, I can turn up the loglevel, but in general I noticed that during some phases during suspend which can last for seconds, there is no visual feedback (except for the HD LED) that things are still being worked on and the system is indeed working on suspending or resuming and not hung. But that's a candidate for a feature request or for a separate bugzilla entry. 2.6.20-rc4 (I switched to -rc5 meanwhile) is apparently better than any earlyer kernels which I tested, but still has two issues on this laptop: -When the kernel is compiled with CONFIG_MTRR, at the end of resume, things break: Depending on the other config options/loaded modules/test case, it either triggers a reboot of the system after it prints the message for Booting the 2nd CPU (similar place for non SMP kernel) or it just hangs some seconds after reading the image back from disk. - When the tg3 module is loaded on suspend, it hangs on suspend, after writing the image. I have got some MTRR setup info from booting with mtrr.show=1 and will post it tomorrow. -- Configure bugmail: https://bugzilla.novell.com/userprefs.cgi?tab=email ------- You are receiving this mail because: ------- You are on the CC list for the bug, or are watching someone who is.
https://bugzilla.novell.com/show_bug.cgi?id=228344 ------- Comment #19 from bk@novell.com 2007-01-24 11:13 MST ------- Created an attachment (id=114778) --> (https://bugzilla.novell.com/attachment.cgi?id=114778&action=view) "dmesg|grep -B5 MTRR" from 2.6.20-rc5-20070122124535-default (kernel CVS from this monday) x86-64 kernel, booted with mtrr.show=1 Regarding comment #17: This bug is for x86_64, so I mostly debugged using x86_64 and CONFIG_PAE doesn' exist there. I didn't make this clear in the bug's text and since the hardware information of this bug to x86_64, I'll set it. I'm attaching the debug info info from mtrr.show=1 I'll cc Andi, Bodo told me that he did something with MTRR's and should be able to help with MTRR's. -- Configure bugmail: https://bugzilla.novell.com/userprefs.cgi?tab=email ------- You are receiving this mail because: ------- You are on the CC list for the bug, or are watching someone who is.
https://bugzilla.novell.com/show_bug.cgi?id=228344 bk@novell.com changed: What |Removed |Added ---------------------------------------------------------------------------- CC| |ak@novell.com Status|ASSIGNED |NEEDINFO Info Provider| |ak@novell.com ------- Comment #20 from bk@novell.com 2007-01-25 05:37 MST ------- Hi Andi, as Bodo mentioned that you have experiance with MTRR's, could you have a look into the attachment 114778 of the previous comment? It's an AMD task statement machine (Ferrari 1000) where resume from disk with CONFIG_MTRR causes a reboot with power-cylce at the point where the second CPU is booted (that's the last message which I can see before the display is reset), but it also happens on a Uniprocessor when CONFIG_MTRR is on. Processor is (from kernel logs): AMD Turion(tm) 64 X2 Mobile Technology TL-56 stepping 02 -- Configure bugmail: https://bugzilla.novell.com/userprefs.cgi?tab=email ------- You are receiving this mail because: ------- You are on the CC list for the bug, or are watching someone who is.
https://bugzilla.novell.com/show_bug.cgi?id=228344 bodo.bauer@novell.com changed: What |Removed |Added ---------------------------------------------------------------------------- CC|bodo.bauer@novell.com | ------- Comment #21 from bodo.bauer@novell.com 2007-01-25 08:39 MST ------- I'm leaving Novell. If TPM assistance is needed, please ask Joachim Plack (AMD related issues) or Oliver Ries (general x86_64/i386) for assitance. -- Configure bugmail: https://bugzilla.novell.com/userprefs.cgi?tab=email ------- You are receiving this mail because: ------- You are on the CC list for the bug, or are watching someone who is.
https://bugzilla.novell.com/show_bug.cgi?id=228344 bk@novell.com changed: What |Removed |Added ---------------------------------------------------------------------------- Summary|Acer Ferrari 1000 does not |Acer Ferrari 1000 resets on resume with |suspend to disk - PCI |CONFIG_MTRR |problems? | ------- Comment #22 from bk@novell.com 2007-01-30 09:45 MST ------- Status summary (for Andi or later Andrea or AMD techs): On resume, if the kernel is compiled with CONFIG_MTRR (x86_64 and i386), the last message which I can read before the display is reset, is: Booting processor 1/2 APIC 0x1 A kernel without CONFIG_MTRR gives these messages afterwards: Initializing CPU#1 Calibrating delay using timer specific routine.. 1600.16 BogoMIPS (lpj=3200321) CPU: L1 I Cache: 64K (64 bytes/line), D cache 64K (64 bytes/line) CPU: L2 Cache: 512K (64 bytes/line) CPU 1/1 -> Node 0 CPU: Physical Processor ID: 0 CPU: Processor Core ID: 1 AMD Turion(tm) 64 X2 Mobile Technology TL-56 stepping 02 CPU 1: Syncing TSC to CPU 0. CPU 1: synchronized TSC with CPU 0 (last diff 0 cycles, maxerr 452 cycles) This also happens on Uniprocessor kernels and other kernels with quite small configuration, with init=/bin/sh, and with vga=0 on plan vanilla kernels, so it does not appear to me like an interaction with some non-required driver. Andi, any ideas? -- Configure bugmail: https://bugzilla.novell.com/userprefs.cgi?tab=email ------- You are receiving this mail because: ------- You are on the CC list for the bug, or are watching someone who is.
https://bugzilla.novell.com/show_bug.cgi?id=228344 bk@novell.com changed: What |Removed |Added ---------------------------------------------------------------------------- Status|NEEDINFO |ASSIGNED Info Provider|ak@novell.com | ------- Comment #23 from bk@novell.com 2007-02-07 09:36 MST ------- Sorry that this comment is a bit long, but I really wanted to give a complete update: Andi suggested debugging the MTRR resume function, but while trying that on the CVS kernel (which otherwise has working suspend/resume when CONTIG_MTRR is off because it has no bad interaction between tg3 and sata_sil), I found that MTRR breaks suspending with the sata_sil driver, but with the CVS kernel only so far. While sata_sil should in theory be robust to suspend independently of MTRR support in the kernel, this itself is a bug, but this actually helps me a but to hopefully find out what is wrong with MTRRs on the Ferrari 1000. At least, they should normally not have such impact, I guess... While adopting Andi's development patches to switch from MTRR to PAT would be too bleeding-edge for general adoption just now, starting to try out his patches for PAT and disabling MTRR on the brand new Ferrari 1000 (controlled by DMI) would be an option which I'd be interested in. During the last release candidates of 2.6.19, a big fix for the MTRR code was adopted, so I could also try to revert it an see: http://lkml.org/lkml/2006/11/15/144 While suspending to disk (but not resuming from disk) worked with CONFIG_MTRR on on plain vanilla 2.6.20-rc5, I do not think that this is a bug in sata_sil, because sata_sil otherwise suspends and resumes fine with CONFIG_MTRR is off and it even works with CONFIG_MTRR enabled on a similar machine (HP nx6325) with the same CPU stepping (TL-56 stepping 02) and same ATI RS480 chipset. The only issue which I still saw with sata_sil in the CVS kernel build was that if using vesafb and with console log level was 8 and PCI_- and PM_DEBUG on, it still lost it's disk, probably still because the intense logging on vesafb takes too long to make the driver recover. So the sata_sil driver in CVS is apparently improved, but still not useful for suspend debugging with many messages logged over a slow output like vesafb. Pure VGA Text mode console logging was fine. Since the machine does not reset itself but hang, that new symptom is better because at that point, I may still be able to read the Ferrari's RAM chips over DMA using the thankfully on-board OHCI1394 controller over Firewire using Andi's x86_64 port of firescope. That would be harder to do after a complete machine reset and so I could get debugging information which the CPU does not even have to output somewhere, it just has to be written to the RAM chips, from where I can read it over firewire. Thanks to Andi also for some tips on how to set that up. If reverting the last MTRR fix does not help and Andi's PAT code is not yet well enough (a small fix would be also preferred if possible), I'd have to look at the MTRR initialisation routines, reduce MTRR initialisation to a minimum, see if that works and start setting more MTRR registers. -- Configure bugmail: https://bugzilla.novell.com/userprefs.cgi?tab=email ------- You are receiving this mail because: ------- You are on the CC list for the bug, or are watching someone who is.
https://bugzilla.novell.com/show_bug.cgi?id=228344 ------- Comment #24 from ak@novell.com 2007-02-07 10:52 MST ------- I don't think PAT will help because SIL shouldn't write combining and for non cached mapping the in kernel code already DTRT. Have you checked if it's suspend or resume yet as I suggested? Then disabling the individual mtrrs and finding which one causes it should give some clues. This can be all done without a working console -- Configure bugmail: https://bugzilla.novell.com/userprefs.cgi?tab=email ------- You are receiving this mail because: ------- You are on the CC list for the bug, or are watching someone who is.
https://bugzilla.novell.com/show_bug.cgi?id=228344 ------- Comment #25 from bk@novell.com 2007-02-09 09:58 MST ------- Just a quick update: I have found that the execution of set_fixed_ranges() which sets up the fixed-range MTRRs cause the sata_sil driver to fail by misreading the size of sda in the resume phase while suspending to disk by returning from it without setting up the fixed-range MTRRs. The reset while resuming from disk while initializing the CPU (I think 2nd core, have to check to logs closer) which happened with the vanilla 2.6.20-rc5 is also gone when set_fixed_ranges() is changed to not set the fixed-range MTRRs. As Andi suggested, I'll try to nail down which of the MTRRs causes the problem. Another idea which comes to my mind: As the MTRR are AFAICS are initially set-up by the BIOS, I guess a look for an BIOS upgrade could also be useful, but in case the BIOS does not allow downgrading, I will not be able to further debug the issue then, so I am a bit hestitant to that. However, it would be good to know if the BIOS changelog has something, I guess. I do not yet really understand what the kernel does with the fixed-range MTRRs: I have not seen any possibility in the kernel to change the setup of the fixed-range MTRRs, so it looks to me that set_fixed_ranges() should normally do nothing to the MTRRs, as nobody ever asks to change something, so if everything is correct, it should not try to write to them, it should normally only verify and restore them in case the initial setup from the BIOS is lost sometime. I hope that we can sched some light on this.... For a start, I have added some debug printks to the function to see more what it's doing and where it's causing the reset and the sata_sil trouble. -- Configure bugmail: https://bugzilla.novell.com/userprefs.cgi?tab=email ------- You are receiving this mail because: ------- You are on the CC list for the bug, or are watching someone who is.
https://bugzilla.novell.com/show_bug.cgi?id=228344 ------- Comment #26 from pavel@novell.com 2007-02-13 03:53 MST ------- I'd do BIOS upgrade if I were you. If it is BIOS problem, and it is fixed in newer BIOS, you do not want to waste your time trying to debug it. -- Configure bugmail: https://bugzilla.novell.com/userprefs.cgi?tab=email ------- You are receiving this mail because: ------- You are on the CC list for the bug, or are watching someone who is.
https://bugzilla.novell.com/show_bug.cgi?id=228344 ------- Comment #27 from bk@novell.com 2007-02-14 09:40 MST ------- There is indeed a newer BIOS available from Acer, six-teen versions to be exact, but only the last version is available, without changelog. At least the Windows-only Flasher seems to backup the old BIOS image to a backup file, so I guess I could restore the current BIOS in case. There is no support for upgrades. If something goes wrong -> not Acer's fault. I could at least understand if somebody would not like to mess with the BIOS. Back to teh technical analysis: In the way I understand the x86-64 manual, the initial setup of Extended Fixed-Range MTRRs is wrong. On early boot, after identifiying the boot CPU, I dumped MTRR and SYSCFG MSR contents and found the following Extended Fixed-Range MTRR setup in the boot CPU: 1) Fixed-Range MTRRs are enabled (MTRR_deftype bits 10 and 11 are set) 2) Extended Fixed-Range MTRRs are enabled and RdMem and Wrmem bits are configured to be rw-accessible (The MtrrFixDramEn and MtrrFixModEn bits in the SYSCFG MSR are set) 2) All WrMem and RdMem bits are 0, but the MTRRs for 00000-9FFFF (0-640k) are set to write-back (WB). To repeat, the entire register's value is 6, and Fixed-Range MTRRs are enabled as well. This combination, according to the page 235 of AMD publication 24593 from http://www.amd.com/gb-uk/Processors/TechnicalResources/0,,30_182_739_7044,00... is not an "allowable combination" and results in "undefined" and "unpredictable" behaviour. But when I retrieve the MTRRs after boot, I get this register dump: CPU 0 MTRR 00000-7FFFF RdMem+Wrmem+write-back CPU 0 MTRR 80000-9FFFF RdMem+Wrmem+write-back CPU 0 MTRR C0000-C7FFF RdMem+write-protect CPU 0 MTRR C8000-CFFFF RdMem+write-protect CPU 0 MTRR E4000-E7FFF RdMem+write-protect CPU 0 MTRR E8000-EFFFF RdMem+write-protect CPU 0 MTRR F0000-F7FFF RdMem+write-protect CPU 0 MTRR F8000-FFFFF RdMem+write-protect CPU 1 MTRR 00000-7FFFF write-back CPU 1 MTRR 80000-9FFFF write-back CPU 1 MTRR C0000-C7FFF write-protect CPU 1 MTRR C8000-CFFFF write-protect CPU 1 MTRR E4000-E7FFF write-protect CPU 1 MTRR E8000-EFFFF write-protect CPU 1 MTRR F0000-F7FFF write-protect CPU 1 MTRR F8000-FFFFF write-protect While CPU Core 1 still had the same setup as in early boot, the access types of the Boot CPU Core have drastically changed. A number of memory ranges which are often used for ROM are changed to a shadowed ROM configuration, as if something shadowed ROM respective memory-mapped I/O to system memory for shadowed ROM while the 1st 640k of memory are now properly configured for write-back. Now, when the inital MTRR setup of the two CPU core is restored during suspend after saving+restoring the CPU context in {save,restore}_processor_state, the bug manifests itself in the form of sata_sil from the kernel CVS failing to access sda when it's time to write the suspend image or a system reset later on resume with older kernels. I have not found Linux kernel code which would set the RdMem and Wrmem bits, ACPI seems to be not able to do wrmsr but an SMI (system management interrupt) may do it: http://en.wikipedia.org/wiki/System_Management_Mode A good fix may however be to save the MTRR state in save_processor_state, including the extended fixed-range MTRRs (currently nothing regarding MTRRs is done there) and restore it in restore_processor_state, and ensuring that all running CPUs have the save MTRR setup. -- Configure bugmail: https://bugzilla.novell.com/userprefs.cgi?tab=email ------- You are receiving this mail because: ------- You are on the CC list for the bug, or are watching someone who is.
https://bugzilla.novell.com/show_bug.cgi?id=228344 bk@novell.com changed: What |Removed |Added ---------------------------------------------------------------------------- Status|ASSIGNED |NEEDINFO Info Provider| |mark.langsdorf@amd.com ------- Comment #28 from bk@novell.com 2007-02-19 06:46 MST ------- I tested saving the Fixed-range MTRR state in save_processor_state, but on restore_processor_state for the second core, an exception is thrown by WRMSR when set_fixed_ranges tries to set the RdMem and/or WrMem attributes in the Extended MTRR Type-Field Format for Fixed-Range MTRRs which are described in section 7.8.1 of AMD publication 24593, Rev. 3.12 ( http://www.amd.com/us-en/assets/content_type/white_papers_and_tech_docs/2459... ) which are already set on the boot core. That kind exception is handled by the MTRR code already, it logs that the MSR for which WRMSR failed and continues and the system resumes otherwise fine. The code to handle an exception from WRMSR is in mainline because it was a work-around for an BIOS bug regarding MTRRs, found and fixed by SuSE/Andi: http://kernel.org/pub/linux/kernel/people/akpm/patches/2.6/2.6.12-rc1/2.6.12... I looked up the reasons why WRMSR may throw an exception on page 359 of http://www.amd.com/us-en/assets/content_type/white_papers_and_tech_docs/2459... and AFACS, none of them is present, so at least from that documentation, the exception shouldn't be thrown. I also verified that the MtrrFixDramModEn and MtrrFixDramEn bits are set in the SYSCFG MSR of the second core immediately before trying to set the bits, but both are set, so nothing should prevent to set these bits, but the system throws the exception nonetheless. I tried to search for AMD erratas and Turion 64 X2 docs which would document this, but didn't find documentation on that. Mark, may I be right with the following: May it be that these bits may not be set on the second TL-56 core despite having two a separate L2 cache since this kind of memory configuration (shadowed ROM and write-back memory below 1M) and is more usual for the system BIOS (which will only use the boot core) than by modern operating systems which use both cores? As a workaround, the fixed-range MTRRs could be saved&restored using separate backup values per logical CPU. The resulting suspend/resume setup would not change, except that the exceptions would not be thrown and logged and the MTRRs would be restored as it was before resume. -- Configure bugmail: https://bugzilla.novell.com/userprefs.cgi?tab=email ------- You are receiving this mail because: ------- You are on the CC list for the bug, or are watching someone who is.
https://bugzilla.novell.com/show_bug.cgi?id=228344 ------- Comment #29 from ak@novell.com 2007-02-19 07:17 MST ------- The MTRR save/restore procedure doesn't 100% follow the intel recommendations. Maybe it's related to that. It was on my todo list to fix that up.
A good fix may however be to save the MTRR state in save_processor_state, including the extended fixed-range MTRRs (currently nothing regarding MTRRs is done there) and restore it in restore_processor_state, and ensuring that all running CPUs have the save MTRR setup.
This sounds like a good idea. Can you try that? -- Configure bugmail: https://bugzilla.novell.com/userprefs.cgi?tab=email ------- You are receiving this mail because: ------- You are on the CC list for the bug, or are watching someone who is.
https://bugzilla.novell.com/show_bug.cgi?id=228344 ------- Comment #30 from bk@novell.com 2007-02-19 11:10 MST ------- Yes, looks good. A quick proof-of-concept, tailored to the dual-core Ferrari 1000 works on it. Just for additional information, but not of importance, do you have (or has anybody here) a pointer to the recommendations for MTRR save/restore? In the first step which I described in comment #28, I initialized the MTRRs of the AP with the values which I got from the BP in disable_nonboot_cpus() (because I saw that it runs during suspend) - That was an easy quick test. Because that caused exceptions on the second core of this Ferrari 1000 (as described), I used a separate MTRR backup for the AP (the AP does not accept the same MTRR setup on this machine) - and that works fine and really restores what was there before suspend (which should be safe). I'll extract the changes and post them for review here. -- Configure bugmail: https://bugzilla.novell.com/userprefs.cgi?tab=email ------- You are receiving this mail because: ------- You are on the CC list for the bug, or are watching someone who is.
https://bugzilla.novell.com/show_bug.cgi?id=228344 bk@novell.com changed: What |Removed |Added ---------------------------------------------------------------------------- Status|NEEDINFO |ASSIGNED Info Provider|mark.langsdorf@amd.com | ------- Comment #31 from bk@novell.com 2007-02-20 12:49 MST ------- I had a simple off-by-one error in checking the MtrrFixDramModEn bit and while trying to set RdMem and WrMem, the bit was not set, so this was the reason why the exception which occurred in wrmsr so everything works as documented and I can set the RdMem and WrMem and bits successfully on the second CPU core. I now worked on extending my changes which save and restore the extended fixed-range MTRRs which are a special feature of the AMD64 architecture during suspend/resume to also syncronize them between CPUs as described as a "must" in section 7.6.5 "MTRRs in Multi-Processing Environments" of the AMD x86-64 programmer's manual, vol. 2, despite the BIOS not conforming to that, so that part is actually a BIOS workaround, but the MTRR code already has a tradition in doing so, e.g. in comment #28, I mentioned a MTRR BIOS bug workaround by Andi and there are more, e.g for more easyer cases than the Ferrari 1000, the Linux MTRR code already did the right thing, but with the extended fixed-range MTRRs of the AMD64 architecture, and the Ferrari 1000 setting up shadow RAM when enabling ACPI mode made it bit more tricky. But on the other hand, I think fixing this nearly worst-case scenario was a good test for this code to save/restore and snyc the AMD64 extended fixed-range MTRRs should be working once and for all. I'll dig out the combined changes and post them for review. -- Configure bugmail: https://bugzilla.novell.com/userprefs.cgi?tab=email ------- You are receiving this mail because: ------- You are on the CC list for the bug, or are watching someone who is.
https://bugzilla.novell.com/show_bug.cgi?id=228344 bk@novell.com changed: What |Removed |Added ---------------------------------------------------------------------------- Attachment #109606|0 |1 is obsolete| | ------- Comment #32 from bk@novell.com 2007-02-21 11:09 MST ------- Created an attachment (id=120361) --> (https://bugzilla.novell.com/attachment.cgi?id=120361&action=view) roof of concept for fixing save/restore/sync of extended fixed-range MTRRs on AMD64 This is the extracted patch, but it is only a proof of concept. It works on my the Ferrari 1000 and should to "the right thing"(tm), IMHO. It's ready for review, but needs more changes, e.g. it would not be accepted into mainline in this state because it accesses the possibly K8-specific SYSCFG MSR directly, but I don't know if this MSR is even there on the Intel 64 CPUs. Maybe AMD knows wether it is, but I have not found it described in the Intel 64 and IA-32 programmer's manual. In any case these accesses would need to be encapsulated to only be done on x86_64 or even only on AMD64 (K8) CPUs. Otherwise it could be used now and works fine. I also verifified the MTRR state to be sure that not only suspend2disk works but that the MTRRs are indeed in SYNC. Preferably set_fixed_ranges() should be rewritten to use a new function called update_fixed_range which handles the reoccurring code of reading an MSR, checking if it needs update and updating it if needed, with status returned. That would reduce the code size while adding the support for AMD64 extended fixed range MTRRs. I could do so as well, but I first wanted to show what I did in essence, and when we know how to draft it finally, I can do it of course. -- Configure bugmail: https://bugzilla.novell.com/userprefs.cgi?tab=email ------- You are receiving this mail because: ------- You are on the CC list for the bug, or are watching someone who is.
https://bugzilla.novell.com/show_bug.cgi?id=228344 ------- Comment #33 from bk@novell.com 2007-02-22 10:16 MST ------- Created an attachment (id=120608) --> (https://bugzilla.novell.com/attachment.cgi?id=120608&action=view) updated fix with detection of Hammer, includes rewrite of set_fixed_ranges, tested on Ferrari 1000 in 64-bit mode - for review I could not access the K8 SYSCFG MSR on Intel64 (tried it in legacy mode only) and http://en.wikipedia.org/wiki/EM64T#Differences_between_AMD64_and_Intel_64 also says that it does not exist. So this updated patch accesses the K8 SYSCFG MSR only on Hammer CPUs, also in legacy mode (I assume that there is nothing different there, but I will test it), but not on Intel 64, Athlon or other CPUs. It contains a rewrite of set_fixed_ranges() which was a spaghetti-implementation of what could have been written better (smaller in code and source and easy to read) using an inlined function, but without the code for handling WrMem and RdMem, my rewrite has the exact same object size, so there was no big deal in rewriting it, other than readability. However, by adding support for WrMem/WrMem, the code is smaller when not inlined (speed is not a concern in this code path) so a rewrite to use a function was even better for code size. The patch could be optimised even more for object size, but that is not the main topic now. Anyway, with this version, result is that the object code with the rewrite is more 100 bytes smaller than that of the previous patch, even with the K8 check which isn't even included in the previous patch, but also the source code is more readable. This patch works on the Ferrari 1000 in long mode (I verified the MTRRs right after boot and after many resumes to be in sync and not messed up), and will also test what happens in legacy mode. What remains is a test in 32-bit mode and the patch could be further polished to be ready for submission. Comments welcome. -- Configure bugmail: https://bugzilla.novell.com/userprefs.cgi?tab=email ------- You are receiving this mail because: ------- You are on the CC list for the bug, or are watching someone who is.
https://bugzilla.novell.com/show_bug.cgi?id=228344 ------- Comment #34 from bk@novell.com 2007-02-23 08:28 MST ------- Created an attachment (id=120787) --> (https://bugzilla.novell.com/attachment.cgi?id=120787&action=view) polished patch, reduces binary size by one for loop for all calls to set_fixed_range Attached updated patch, Description is: "reduces binary size by one for loop for all calls to set_fixed_range." Contains only small changes. I found that my test program to verify the MTTR contents had a bug so I didn't notice that the patch doesn't syncronize the MTRRs after the SMI for the transition to ACPI mode. Initially I thought that I could call mtrr_save_fixed_ranges() in the window between acpi_enable() and the init of the APs, but it would be better if do a smp_call_function_single(0, mtrr_save_fixed_ranges, NULL, 1, 1); in mtrr_ap_init, which works, but it triggers a WARN_ON smp_call_function_single() is called with interrupts disabled because it could deadlock then, so I will probably better lookup the places where the APs are signaled to boot and call mtrr_save_fixed_ranges() there. -- Configure bugmail: https://bugzilla.novell.com/userprefs.cgi?tab=email ------- You are receiving this mail because: ------- You are on the CC list for the bug, or are watching someone who is.
https://bugzilla.novell.com/show_bug.cgi?id=228344 ------- Comment #35 from joachim.deguara@amd.com 2007-02-25 14:58 MST ------- the same problem, premature shutdown when suspending to disk, happens on the Ferrari 5000 also, but the patch from comment #34 fixes it. Tested on openSUSE 10.2 with 2.6.21-rc1-git1 kernel. -- Configure bugmail: https://bugzilla.novell.com/userprefs.cgi?tab=email ------- You are receiving this mail because: ------- You are on the CC list for the bug, or are watching someone who is.
https://bugzilla.novell.com/show_bug.cgi?id=228344 ------- Comment #36 from bk@novell.com 2007-02-26 10:48 MST ------- Created an attachment (id=121120) --> (https://bugzilla.novell.com/attachment.cgi?id=121120&action=view) updated fix which works on arch/i386 as well, should be ready for submission, just one minor beautification missing Attached updated fix which works on arch/i386 as well. It is ready for submission except for one detail which is only beauty(last change to be synced to x86_64) in detail and needs to be tested to work in uniprocessor kernel builds as well and does not break builds of the kernel without MTRR support. With this patch, I verified that the same issue is also there in legacy mode, and verified that it is fixed with the same change. -- Configure bugmail: https://bugzilla.novell.com/userprefs.cgi?tab=email ------- You are receiving this mail because: ------- You are on the CC list for the bug, or are watching someone who is.
https://bugzilla.novell.com/show_bug.cgi?id=228344 ------- Comment #37 from bk@novell.com 2007-02-27 10:12 MST ------- Created an attachment (id=121351) --> (https://bugzilla.novell.com/attachment.cgi?id=121351&action=view) updated patch for review, possibly mtrr_save_state() could be moved to an x86-generic location, but that's the only thing left for now and this can be discussed on linux-kernel on submission I tested this patch now on x86_64, works fine and fixes also the PCIE issue which we initially saw on this bug. -- Configure bugmail: https://bugzilla.novell.com/userprefs.cgi?tab=email ------- You are receiving this mail because: ------- You are on the CC list for the bug, or are watching someone who is.
https://bugzilla.novell.com/show_bug.cgi?id=228344 ------- Comment #38 from bk@novell.com 2007-02-28 10:41 MST ------- Created an attachment (id=121632) --> (https://bugzilla.novell.com/attachment.cgi?id=121632&action=view) Program for reading the fixed-range MTRR values man Test program for reading current fixed-range MTRR values, enable CONFIG_X86_MSR and load msr.ko. Assumes that msr.ko device files are created as /dev/cpu/<#CPU>/msr by udev (default) Example output from broken system (each 64-bit column is from one CPU): MSR 0x250: 1e1e1e1e1e1e1e1e | 0606060606060606 MSR 0x258: 1e1e1e1e1e1e1e1e | 0606060606060606 MSR 0x259: 0000000000000000 | 0000000000000000 MSR 0x268: 1515151515151515 | 0505050505050505 MSR 0x269: 1515151515151515 | 0505050505050505 MSR 0x26a: 0000000000000000 | 0000000000000000 MSR 0x26b: 0000000000000000 | 0000000000000000 MSR 0x26c: 1515151500000000 | 0505050500000000 MSR 0x26d: 1515151515151515 | 0505050505050505 MSR 0x26e: 1515151515151515 | 0505050505050505 MSR 0x26f: 1515151515151515 | 0505050505050505 In fixed setup, the register dump blocks from all CPUs are identical and do not change e.g. after resume from disk. -- Configure bugmail: https://bugzilla.novell.com/userprefs.cgi?tab=email ------- You are receiving this mail because: ------- You are on the CC list for the bug, or are watching someone who is.
https://bugzilla.novell.com/show_bug.cgi?id=228344 bk@novell.com changed: What |Removed |Added ---------------------------------------------------------------------------- Attachment #120787|text/x-patch |text/plain mime type| | Attachment #120787|0 |1 is patch| | -- Configure bugmail: https://bugzilla.novell.com/userprefs.cgi?tab=email ------- You are receiving this mail because: ------- You are on the CC list for the bug, or are watching someone who is.
https://bugzilla.novell.com/show_bug.cgi?id=228344 User trenn@novell.com added comment https://bugzilla.novell.com/show_bug.cgi?id=228344#c40 Thomas Renninger <trenn@novell.com> changed: What |Removed |Added ---------------------------------------------------------------------------- Status|ASSIGNED |RESOLVED Resolution| |FIXED --- Comment #40 from Thomas Renninger <trenn@novell.com> 2008-08-29 03:39:03 MDT --- AFAIK Bernd got this resolved mainline and did forget to close the bug. -- Configure bugmail: https://bugzilla.novell.com/userprefs.cgi?tab=email ------- You are receiving this mail because: ------- You are on the CC list for the bug.
participants (1)
-
bugzilla_noreply@novell.com