[Bug 1187701] New: Ten64 fails to boot (kernel panic in initramfs) from nvme after upgrade to kernel 5.3.18-59.5-default
http://bugzilla.opensuse.org/show_bug.cgi?id=1187701 Bug ID: 1187701 Summary: Ten64 fails to boot (kernel panic in initramfs) from nvme after upgrade to kernel 5.3.18-59.5-default Classification: openSUSE Product: openSUSE Distribution Version: Leap 15.3 Hardware: aarch64 OS: openSUSE Leap 15.3 Status: NEW Severity: Critical Priority: P5 - None Component: Kernel Assignee: kernel-bugs@opensuse.org Reporter: matt@traverse.com.au QA Contact: qa-bugs@suse.de Found By: --- Blocker: --- Created attachment 850544 --> http://bugzilla.opensuse.org/attachment.cgi?id=850544&action=edit Failed boot on kernel 5.3.18-59.5-default Hello, After Leap 15.3 appliance images were updated recently, I have found the newer versions no longer boot on Ten64. Working image build: openSUSE-Leap-15.3-ARM-JeOS-efi.aarch64-2021.05.21-Build9.94.raw.xz / kernel 5.3.18-57-default Not working: openSUSE-Leap-15.3-ARM-JeOS-efi.aarch64-2021.05.21-Build9.106.raw.xz / kernel 5.3.18-59.5-default If I upgrade the kernel inside Build9.94 with 'zypper up', the same issue occurs. If I then choose the previous snapshot from GRUB the system can once again be booted with 5.3.18-57. Boot fails due to this, during initramfs/dracut: [ 6.227357] Internal error: synchronous external abort: 96000210 [#1] SMP [ 6.232443] mmc0: SDHCI controller on 2140000.esdhc [2140000.esdhc] using ADMA 64-bit [ 6.234152] Modules linked in: nvme nvme_core dwc3 sdhci_of_esdhc(+) sdhci_pltfm sdhci t10_pi mmc_core ulpi udc_core rtc_fsl_ftm_alarm i2c_imx gpio_keys sg scsi_mod [ 6.256675] Supported: Yes [ 6.259380] CPU: 6 PID: 7 Comm: kworker/u16:0 Not tainted 5.3.18-59.5-default #1 SLE15-SP3 [ 6.267643] Hardware name: traverse ten64/ten64, BIOS 2020.07-rc1-gb47b96d4 06/25/2021 [ 6.275581] Workqueue: nvme-reset-wq nvme_reset_work [nvme] [ 6.275587] pstate: a0000005 (NzCv daif -PAN -UAO) [ 6.275594] pc : nvme_reset_work+0x16c/0x12f8 [nvme] [ 6.275600] lr : nvme_reset_work+0x164/0x12f8 [nvme] [ OK 6.275601] sp : ffff8000100a3c80 m] Reached targe[ 6.275603] x29: ffff8000100a3c80 x28: ffff0732958af0c0 t Basic[ 6.275606] x27: ffff0732958af0c0 x26: ffff0732954412c0 System. [ 6.275609] x25: ffff073295441300 x24: ffff073295441710 [ 6.275611] x23: ffff0732958af000 x22: ffff073295441000 [ 6.275614] x21: ffffb9bd5bd89000 x20: ffff073295440f10 [ 6.275617] x19: ffff073295441000 x18: ffffffffffffffff [ 6.275620] x17: 0000000000000000 x16: ffffb9bd5a7a35a0 [ 6.275622] x15: ffffb9bd5bd89908 x14: 0000000000000040 [ 6.275625] x13: 0000000000000228 x12: 0000000000000000 [ 6.275628] x11: 0000000000000000 x10: 0000000000001a50 [ 6.275630] x9 : ffff8000100a3d10 x8 : 000000000000007d [ 6.275633] x7 : 0000000000000006 x6 : 0000000000010000 [ 6.275635] x5 : 0000000000000000 x4 : 0000000000000000 [ 6.275638] x3 : 0000000080000000 x2 : 0000000000000000 [ 6.275641] x1 : ffffffffffffffff x0 : ffff8000105b201c [ 6.275644] Call trace: [ 6.275651] nvme_reset_work+0x16c/0x12f8 [nvme] [ 6.275659] process_one_work+0x200/0x458 [ 6.275662] worker_thread+0x144/0x4f0 [ 6.275666] kthread+0x130/0x138 [ 6.275670] ret_from_fork+0x10/0x18 [ 6.275675] Code: aa1c03e0 940005af f9414ac0 91007000 (b9400000) [ 6.275678] ---[ end trace c33296e2e9bf08c4 ]--- I have confirmed this issue with multiple Ten64 units and different SSD models, so it does not appear to be hardware related -- You are receiving this mail because: You are the assignee for the bug.
http://bugzilla.opensuse.org/show_bug.cgi?id=1187701
http://bugzilla.opensuse.org/show_bug.cgi?id=1187701#c1
--- Comment #1 from Mathew McBride
http://bugzilla.opensuse.org/show_bug.cgi?id=1187701
http://bugzilla.opensuse.org/show_bug.cgi?id=1187701#c2
Mathew McBride
http://bugzilla.opensuse.org/show_bug.cgi?id=1187701
http://bugzilla.opensuse.org/show_bug.cgi?id=1187701#c3
Daniel Wagner
http://bugzilla.opensuse.org/show_bug.cgi?id=1187701
http://bugzilla.opensuse.org/show_bug.cgi?id=1187701#c4
--- Comment #4 from Mathew McBride
http://bugzilla.opensuse.org/show_bug.cgi?id=1187701
http://bugzilla.opensuse.org/show_bug.cgi?id=1187701#c5
Daniel Wagner
http://bugzilla.opensuse.org/show_bug.cgi?id=1187701
http://bugzilla.opensuse.org/show_bug.cgi?id=1187701#c6
--- Comment #6 from Daniel Wagner
The NVMe crashdump has been uploaded to: ftp://support-ftp.us.suse.com/incoming/bug-1187701-ten64-nvme-crash.tar
I get no such file. Could you upload it to ziu? -- You are receiving this mail because: You are the assignee for the bug.
http://bugzilla.opensuse.org/show_bug.cgi?id=1187701
http://bugzilla.opensuse.org/show_bug.cgi?id=1187701#c7
--- Comment #7 from Mathew McBride
The NVMe crashdump has been uploaded to: ftp://support-ftp.us.suse.com/incoming/bug-1187701-ten64-nvme-crash.tar
I get no such file. Could you upload it to ziu?
Not sure how I can go about that, I was following the directions from https://www.suse.com/support/kb/doc/?id=000017820 This link should work: https://user.fm/files/v2-8d2c494fea7727dbc1efe18169586476/bug-1187701-ten64-... (146MB) -- You are receiving this mail because: You are the assignee for the bug.
http://bugzilla.opensuse.org/show_bug.cgi?id=1187701
http://bugzilla.opensuse.org/show_bug.cgi?id=1187701#c8
--- Comment #8 from Daniel Wagner
http://bugzilla.opensuse.org/show_bug.cgi?id=1187701
http://bugzilla.opensuse.org/show_bug.cgi?id=1187701#c9
--- Comment #9 from Daniel Wagner
http://bugzilla.opensuse.org/show_bug.cgi?id=1187701
http://bugzilla.opensuse.org/show_bug.cgi?id=1187701#c10
--- Comment #10 from Daniel Wagner
[ 78.869666] nvme 0002:01:00.0: Adding to iommu group 3 [ 78.875156] nvme nvme0: pci function 0002:01:00.0 [ 78.879934] Internal error: synchronous external abort: 96000210 [#2] SMP
-- You are receiving this mail because: You are the assignee for the bug.
http://bugzilla.opensuse.org/show_bug.cgi?id=1187701
http://bugzilla.opensuse.org/show_bug.cgi?id=1187701#c13
--- Comment #13 from Mathew McBride
(In reply to Jiri Slaby from comment #11)
(In reply to Mathew McBride from comment #4)
I also have an ath10k_pci wireless card (QCA6174) in this system and 'modprobe ath10k_pci' simply stalls with no error messages. So a general PCIe issue is quite likely here.
The above crash explains this.
Anyway, if you blacklist ath10k_pci, does the system boot or the nvme issue remains?
The nvme issue remains, it occurs in systems where the only the NVMe device is the only PCIe device. It turns out I did not have kdump started for the other two cases, so I now have crashdumps for: a) Crash on ahci module load https://user.fm/files/v2-b292f37d90102413f7eb4af2b47dc3be/bug-1187701-ahci-c... b) Crash on ath10k_pci module load https://user.fm/files/v2-8a9b2e2fe9203736ec0f9bea779294c3/bug-1187701-ath10k... In both of the above, the relevant PCIe device is the only card installed. Interesting to note that both the above occur through PCI probing: [ 124.230047] Call trace: [ 124.232492] ahci_enable_ahci+0x20/0x90 [libahci] [ 124.237196] ahci_save_initial_config+0x34/0x390 [libahci] [ 124.242689] ahci_init_one+0x360/0xd7c [ahci] [ 124.247044] local_pci_probe+0x44/0x98 [ 124.250790] pci_device_probe+0x130/0x1c0 [ 167.941476] Call trace: [ 167.943923] ath10k_pci_wake_wait+0x44/0xf0 [ath10k_pci] [ 167.949236] ath10k_pci_wake.part.6+0xf4/0x138 [ath10k_pci] [ 167.954810] ath10k_bus_pci_write32+0x88/0xd8 [ath10k_pci] [ 167.960322] ath10k_ce_deinit_pipe+0x5c/0x218 [ath10k_core] [ 167.965896] ath10k_pci_probe+0x44c/0x828 [ath10k_pci] [ 167.971034] local_pci_probe+0x44/0x98 [ 167.974779] pci_device_probe+0x130/0x1c0 [ 167.978786] really_probe+0xdc/0x448 [ 167.982357] driver_probe_device+0x12c/0x148 [ 167.986623] device_driver_attach+0x74/0x98 [ 167.990801] __driver_attach+0x6c/0x168 -- You are receiving this mail because: You are the assignee for the bug.
http://bugzilla.opensuse.org/show_bug.cgi?id=1187701
http://bugzilla.opensuse.org/show_bug.cgi?id=1187701#c14
--- Comment #14 from Daniel Wagner
static void ahci_enable_ahci(void __iomem *mmio) { int i; u32 tmp;
/* turn on AHCI_EN */ tmp = readl(mmio + HOST_CTL); if (tmp & HOST_AHCI_EN) return;
/* Some controllers need AHCI_EN to be written multiple times. * Try a few times before giving up. */ for (i = 0; i < 5; i++) { tmp |= HOST_AHCI_EN; writel(tmp, mmio + HOST_CTL); tmp = readl(mmio + HOST_CTL); /* flush && sanity check */ if (tmp & HOST_AHCI_EN) return; msleep(10); }
WARN_ON(1); }
The ath10k crash happens doing a ioread32 in:
static bool ath10k_pci_is_awake(struct ath10k *ar) { struct ath10k_pci *ar_pci = ath10k_pci_priv(ar); u32 val = ioread32(ar_pci->mem + PCIE_LOCAL_BASE_ADDRESS + RTC_STATE_ADDRESS);
return RTC_STATE_V_GET(val) == RTC_STATE_V_ON; }
-- You are receiving this mail because: You are the assignee for the bug.
http://bugzilla.opensuse.org/show_bug.cgi?id=1187701
http://bugzilla.opensuse.org/show_bug.cgi?id=1187701#c15
--- Comment #15 from Daniel Wagner
http://bugzilla.opensuse.org/show_bug.cgi?id=1187701
http://bugzilla.opensuse.org/show_bug.cgi?id=1187701#c16
--- Comment #16 from Daniel Wagner
http://bugzilla.opensuse.org/show_bug.cgi?id=1187701
http://bugzilla.opensuse.org/show_bug.cgi?id=1187701#c17
Daniel Wagner
http://bugzilla.opensuse.org/show_bug.cgi?id=1187701
http://bugzilla.opensuse.org/show_bug.cgi?id=1187701#c18
J�rgen Gro�
http://bugzilla.opensuse.org/show_bug.cgi?id=1187701
http://bugzilla.opensuse.org/show_bug.cgi?id=1187701#c19
--- Comment #19 from Joerg Roedel
There were no real changes for arch/arm64 and drivers/ata between rpm-5.3.18-57..rpm-5.3.18-59.5.
As it looks like a PCI regression, I checked the changes in drivers/pci and there is a notable change due to bsc#1174426. No idea if they could be the source of the regression.
Adding Joerg, who might be able to say something about bsc#1174426 and this report.
It is possible that the patches from bsc#1174426 are related to this. But as I am no ARM export, I need to know first what this error exactly mean: Internal error: synchronous external abort: 96000210 [#1] SMP What is a synchronous external abort and what does the number mean? -- You are receiving this mail because: You are the assignee for the bug.
http://bugzilla.opensuse.org/show_bug.cgi?id=1187701
http://bugzilla.opensuse.org/show_bug.cgi?id=1187701#c20
--- Comment #20 from Daniel Wagner
http://bugzilla.opensuse.org/show_bug.cgi?id=1187701
http://bugzilla.opensuse.org/show_bug.cgi?id=1187701#c21
--- Comment #21 from Daniel Wagner
http://bugzilla.opensuse.org/show_bug.cgi?id=1187701
http://bugzilla.opensuse.org/show_bug.cgi?id=1187701#c22
--- Comment #22 from Daniel Wagner
{ do_sea, SIGBUS, BUS_OBJERR, "synchronous external abort" },
TL;DR: the value says 'synchronous external abort' :) -- You are receiving this mail because: You are the assignee for the bug.
http://bugzilla.opensuse.org/show_bug.cgi?id=1187701
http://bugzilla.opensuse.org/show_bug.cgi?id=1187701#c30
Daniel Wagner
http://bugzilla.opensuse.org/show_bug.cgi?id=1187701
http://bugzilla.opensuse.org/show_bug.cgi?id=1187701#c31
Mathew McBride
Might this be an issue with the DTB? Can you please try to boot the broken kernel with a DTB from the good kernel and see whether this makes a difference?
Note that I've grabbed the dmesg's/kernel logs from several different systems, depending on what was in front of me at the time. On the Ten64 the DTB is built into the flash and passed to the kernel via EFI, so the DTB is from a 'good kernel' in that respect. I can try producing a DTB from the Leap/SLE kernel source if needed. -- You are receiving this mail because: You are the assignee for the bug.
participants (1)
-
bugzilla_noreply@suse.com