[Bug 1187701] New: Ten64 fails to boot (kernel panic in initramfs) from nvme after upgrade to kernel 5.3.18-59.5-default
http://bugzilla.opensuse.org/show_bug.cgi?id=1187701 Bug ID: 1187701 Summary: Ten64 fails to boot (kernel panic in initramfs) from nvme after upgrade to kernel 5.3.18-59.5-default Classification: openSUSE Product: openSUSE Distribution Version: Leap 15.3 Hardware: aarch64 OS: openSUSE Leap 15.3 Status: NEW Severity: Critical Priority: P5 - None Component: Kernel Assignee: kernel-bugs@opensuse.org Reporter: matt@traverse.com.au QA Contact: qa-bugs@suse.de Found By: --- Blocker: --- Created attachment 850544 --> http://bugzilla.opensuse.org/attachment.cgi?id=850544&action=edit Failed boot on kernel 5.3.18-59.5-default Hello, After Leap 15.3 appliance images were updated recently, I have found the newer versions no longer boot on Ten64. Working image build: openSUSE-Leap-15.3-ARM-JeOS-efi.aarch64-2021.05.21-Build9.94.raw.xz / kernel 5.3.18-57-default Not working: openSUSE-Leap-15.3-ARM-JeOS-efi.aarch64-2021.05.21-Build9.106.raw.xz / kernel 5.3.18-59.5-default If I upgrade the kernel inside Build9.94 with 'zypper up', the same issue occurs. If I then choose the previous snapshot from GRUB the system can once again be booted with 5.3.18-57. Boot fails due to this, during initramfs/dracut: [ 6.227357] Internal error: synchronous external abort: 96000210 [#1] SMP [ 6.232443] mmc0: SDHCI controller on 2140000.esdhc [2140000.esdhc] using ADMA 64-bit [ 6.234152] Modules linked in: nvme nvme_core dwc3 sdhci_of_esdhc(+) sdhci_pltfm sdhci t10_pi mmc_core ulpi udc_core rtc_fsl_ftm_alarm i2c_imx gpio_keys sg scsi_mod [ 6.256675] Supported: Yes [ 6.259380] CPU: 6 PID: 7 Comm: kworker/u16:0 Not tainted 5.3.18-59.5-default #1 SLE15-SP3 [ 6.267643] Hardware name: traverse ten64/ten64, BIOS 2020.07-rc1-gb47b96d4 06/25/2021 [ 6.275581] Workqueue: nvme-reset-wq nvme_reset_work [nvme] [ 6.275587] pstate: a0000005 (NzCv daif -PAN -UAO) [ 6.275594] pc : nvme_reset_work+0x16c/0x12f8 [nvme] [ 6.275600] lr : nvme_reset_work+0x164/0x12f8 [nvme] [ OK 6.275601] sp : ffff8000100a3c80 m] Reached targe[ 6.275603] x29: ffff8000100a3c80 x28: ffff0732958af0c0 t Basic[ 6.275606] x27: ffff0732958af0c0 x26: ffff0732954412c0 System. [ 6.275609] x25: ffff073295441300 x24: ffff073295441710 [ 6.275611] x23: ffff0732958af000 x22: ffff073295441000 [ 6.275614] x21: ffffb9bd5bd89000 x20: ffff073295440f10 [ 6.275617] x19: ffff073295441000 x18: ffffffffffffffff [ 6.275620] x17: 0000000000000000 x16: ffffb9bd5a7a35a0 [ 6.275622] x15: ffffb9bd5bd89908 x14: 0000000000000040 [ 6.275625] x13: 0000000000000228 x12: 0000000000000000 [ 6.275628] x11: 0000000000000000 x10: 0000000000001a50 [ 6.275630] x9 : ffff8000100a3d10 x8 : 000000000000007d [ 6.275633] x7 : 0000000000000006 x6 : 0000000000010000 [ 6.275635] x5 : 0000000000000000 x4 : 0000000000000000 [ 6.275638] x3 : 0000000080000000 x2 : 0000000000000000 [ 6.275641] x1 : ffffffffffffffff x0 : ffff8000105b201c [ 6.275644] Call trace: [ 6.275651] nvme_reset_work+0x16c/0x12f8 [nvme] [ 6.275659] process_one_work+0x200/0x458 [ 6.275662] worker_thread+0x144/0x4f0 [ 6.275666] kthread+0x130/0x138 [ 6.275670] ret_from_fork+0x10/0x18 [ 6.275675] Code: aa1c03e0 940005af f9414ac0 91007000 (b9400000) [ 6.275678] ---[ end trace c33296e2e9bf08c4 ]--- I have confirmed this issue with multiple Ten64 units and different SSD models, so it does not appear to be hardware related -- You are receiving this mail because: You are the assignee for the bug.
http://bugzilla.opensuse.org/show_bug.cgi?id=1187701 http://bugzilla.opensuse.org/show_bug.cgi?id=1187701#c1 --- Comment #1 from Mathew McBride <matt@traverse.com.au> --- Created attachment 850545 --> http://bugzilla.opensuse.org/attachment.cgi?id=850545&action=edit dmesg from working kernel (5.3.18-57) -- You are receiving this mail because: You are the assignee for the bug.
http://bugzilla.opensuse.org/show_bug.cgi?id=1187701 http://bugzilla.opensuse.org/show_bug.cgi?id=1187701#c2 Mathew McBride <matt@traverse.com.au> changed: What |Removed |Added ---------------------------------------------------------------------------- CC| |afaerber@suse.com --- Comment #2 from Mathew McBride <matt@traverse.com.au> --- CC Andreas: Are you able to reproduce on your Ten64? I assume SLES 15 SP3 is also affected. -- You are receiving this mail because: You are the assignee for the bug.
http://bugzilla.opensuse.org/show_bug.cgi?id=1187701 http://bugzilla.opensuse.org/show_bug.cgi?id=1187701#c3 Daniel Wagner <daniel.wagner@suse.com> changed: What |Removed |Added ---------------------------------------------------------------------------- CC| |daniel.wagner@suse.com --- Comment #3 from Daniel Wagner <daniel.wagner@suse.com> --- Between rpm-5.3.18-57..rpm-5.3.18-59.5 we have 69 patches added in the drivers/nvme/host/ directory. I don't think bisecting is the best way forward. From the logs I can't really see why nvme_reset_work() is triggered. Any chance to get a core dump of it? -- You are receiving this mail because: You are the assignee for the bug.
http://bugzilla.opensuse.org/show_bug.cgi?id=1187701 http://bugzilla.opensuse.org/show_bug.cgi?id=1187701#c4 --- Comment #4 from Mathew McBride <matt@traverse.com.au> --- Created attachment 850592 --> http://bugzilla.opensuse.org/attachment.cgi?id=850592&action=edit error on loading ahci - possibly related PCIe issue I suspect this might be a general PCIe issue instead of just nvme. I setup a system with a SATA SSD via a PCIe controller card (JMicron JMB545 based) and it failed to boot as well. Booting from USB I was able to blacklist nvme and ahci, and then obtain a crashdump when nvme was modprobe'd after boot. The NVMe crashdump has been uploaded to: ftp://support-ftp.us.suse.com/incoming/bug-1187701-ten64-nvme-crash.tar modprobe of ahci did not cause a kernel crash but the call trace has been attached. I also have an ath10k_pci wireless card (QCA6174) in this system and 'modprobe ath10k_pci' simply stalls with no error messages. So a general PCIe issue is quite likely here. -- You are receiving this mail because: You are the assignee for the bug.
http://bugzilla.opensuse.org/show_bug.cgi?id=1187701 http://bugzilla.opensuse.org/show_bug.cgi?id=1187701#c5 Daniel Wagner <daniel.wagner@suse.com> changed: What |Removed |Added ---------------------------------------------------------------------------- CC| |jslaby@suse.com Flags| |needinfo?(jslaby@suse.com) --- Comment #5 from Daniel Wagner <daniel.wagner@suse.com> --- Indeed, it's not the storage subsystem which crashes. Adding Jiri, as he might know what's going on here. -- You are receiving this mail because: You are the assignee for the bug.
http://bugzilla.opensuse.org/show_bug.cgi?id=1187701 http://bugzilla.opensuse.org/show_bug.cgi?id=1187701#c6 --- Comment #6 from Daniel Wagner <daniel.wagner@suse.com> ---
The NVMe crashdump has been uploaded to: ftp://support-ftp.us.suse.com/incoming/bug-1187701-ten64-nvme-crash.tar
I get no such file. Could you upload it to ziu? -- You are receiving this mail because: You are the assignee for the bug.
http://bugzilla.opensuse.org/show_bug.cgi?id=1187701 http://bugzilla.opensuse.org/show_bug.cgi?id=1187701#c7 --- Comment #7 from Mathew McBride <matt@traverse.com.au> --- (In reply to Daniel Wagner from comment #6)
The NVMe crashdump has been uploaded to: ftp://support-ftp.us.suse.com/incoming/bug-1187701-ten64-nvme-crash.tar
I get no such file. Could you upload it to ziu?
Not sure how I can go about that, I was following the directions from https://www.suse.com/support/kb/doc/?id=000017820 This link should work: https://user.fm/files/v2-8d2c494fea7727dbc1efe18169586476/bug-1187701-ten64-... (146MB) -- You are receiving this mail because: You are the assignee for the bug.
http://bugzilla.opensuse.org/show_bug.cgi?id=1187701 http://bugzilla.opensuse.org/show_bug.cgi?id=1187701#c8 --- Comment #8 from Daniel Wagner <daniel.wagner@suse.com> --- I am sorry Mathew. I have to ask our L3 folks (they have access to the ftp servers) to help out with uploading it to ziu. Anyway, your download link works for me so we don't have to bother L3. -- You are receiving this mail because: You are the assignee for the bug.
http://bugzilla.opensuse.org/show_bug.cgi?id=1187701 http://bugzilla.opensuse.org/show_bug.cgi?id=1187701#c9 --- Comment #9 from Daniel Wagner <daniel.wagner@suse.com> --- Created attachment 850637 --> http://bugzilla.opensuse.org/attachment.cgi?id=850637&action=edit dmesg 5.3.18-59.5-default -- You are receiving this mail because: You are the assignee for the bug.
http://bugzilla.opensuse.org/show_bug.cgi?id=1187701 http://bugzilla.opensuse.org/show_bug.cgi?id=1187701#c10 --- Comment #10 from Daniel Wagner <daniel.wagner@suse.com> --- Before the nvme core crashes this happens right before:
[ 78.869666] nvme 0002:01:00.0: Adding to iommu group 3 [ 78.875156] nvme nvme0: pci function 0002:01:00.0 [ 78.879934] Internal error: synchronous external abort: 96000210 [#2] SMP
-- You are receiving this mail because: You are the assignee for the bug.
http://bugzilla.opensuse.org/show_bug.cgi?id=1187701 http://bugzilla.opensuse.org/show_bug.cgi?id=1187701#c13 --- Comment #13 from Mathew McBride <matt@traverse.com.au> --- (In reply to Jiri Slaby from comment #12)
(In reply to Jiri Slaby from comment #11)
(In reply to Mathew McBride from comment #4)
I also have an ath10k_pci wireless card (QCA6174) in this system and 'modprobe ath10k_pci' simply stalls with no error messages. So a general PCIe issue is quite likely here.
The above crash explains this.
Anyway, if you blacklist ath10k_pci, does the system boot or the nvme issue remains?
The nvme issue remains, it occurs in systems where the only the NVMe device is the only PCIe device. It turns out I did not have kdump started for the other two cases, so I now have crashdumps for: a) Crash on ahci module load https://user.fm/files/v2-b292f37d90102413f7eb4af2b47dc3be/bug-1187701-ahci-c... b) Crash on ath10k_pci module load https://user.fm/files/v2-8a9b2e2fe9203736ec0f9bea779294c3/bug-1187701-ath10k... In both of the above, the relevant PCIe device is the only card installed. Interesting to note that both the above occur through PCI probing: [ 124.230047] Call trace: [ 124.232492] ahci_enable_ahci+0x20/0x90 [libahci] [ 124.237196] ahci_save_initial_config+0x34/0x390 [libahci] [ 124.242689] ahci_init_one+0x360/0xd7c [ahci] [ 124.247044] local_pci_probe+0x44/0x98 [ 124.250790] pci_device_probe+0x130/0x1c0 [ 167.941476] Call trace: [ 167.943923] ath10k_pci_wake_wait+0x44/0xf0 [ath10k_pci] [ 167.949236] ath10k_pci_wake.part.6+0xf4/0x138 [ath10k_pci] [ 167.954810] ath10k_bus_pci_write32+0x88/0xd8 [ath10k_pci] [ 167.960322] ath10k_ce_deinit_pipe+0x5c/0x218 [ath10k_core] [ 167.965896] ath10k_pci_probe+0x44c/0x828 [ath10k_pci] [ 167.971034] local_pci_probe+0x44/0x98 [ 167.974779] pci_device_probe+0x130/0x1c0 [ 167.978786] really_probe+0xdc/0x448 [ 167.982357] driver_probe_device+0x12c/0x148 [ 167.986623] device_driver_attach+0x74/0x98 [ 167.990801] __driver_attach+0x6c/0x168 -- You are receiving this mail because: You are the assignee for the bug.
http://bugzilla.opensuse.org/show_bug.cgi?id=1187701 http://bugzilla.opensuse.org/show_bug.cgi?id=1187701#c14 --- Comment #14 from Daniel Wagner <daniel.wagner@suse.com> --- The ahci crash happens doing the readl in:
static void ahci_enable_ahci(void __iomem *mmio) { int i; u32 tmp;
/* turn on AHCI_EN */ tmp = readl(mmio + HOST_CTL); if (tmp & HOST_AHCI_EN) return;
/* Some controllers need AHCI_EN to be written multiple times. * Try a few times before giving up. */ for (i = 0; i < 5; i++) { tmp |= HOST_AHCI_EN; writel(tmp, mmio + HOST_CTL); tmp = readl(mmio + HOST_CTL); /* flush && sanity check */ if (tmp & HOST_AHCI_EN) return; msleep(10); }
WARN_ON(1); }
The ath10k crash happens doing a ioread32 in:
static bool ath10k_pci_is_awake(struct ath10k *ar) { struct ath10k_pci *ar_pci = ath10k_pci_priv(ar); u32 val = ioread32(ar_pci->mem + PCIE_LOCAL_BASE_ADDRESS + RTC_STATE_ADDRESS);
return RTC_STATE_V_GET(val) == RTC_STATE_V_ON; }
-- You are receiving this mail because: You are the assignee for the bug.
http://bugzilla.opensuse.org/show_bug.cgi?id=1187701 http://bugzilla.opensuse.org/show_bug.cgi?id=1187701#c15 --- Comment #15 from Daniel Wagner <daniel.wagner@suse.com> --- Created attachment 850709 --> http://bugzilla.opensuse.org/attachment.cgi?id=850709&action=edit dmesg-ath10k.txt -- You are receiving this mail because: You are the assignee for the bug.
http://bugzilla.opensuse.org/show_bug.cgi?id=1187701 http://bugzilla.opensuse.org/show_bug.cgi?id=1187701#c16 --- Comment #16 from Daniel Wagner <daniel.wagner@suse.com> --- Created attachment 850710 --> http://bugzilla.opensuse.org/attachment.cgi?id=850710&action=edit dmesg-ahci.txt -- You are receiving this mail because: You are the assignee for the bug.
http://bugzilla.opensuse.org/show_bug.cgi?id=1187701 http://bugzilla.opensuse.org/show_bug.cgi?id=1187701#c17 Daniel Wagner <daniel.wagner@suse.com> changed: What |Removed |Added ---------------------------------------------------------------------------- CC| |jgross@suse.com Flags| |needinfo?(jgross@suse.com) --- Comment #17 from Daniel Wagner <daniel.wagner@suse.com> --- There were no real changes for arch/arm64 and drivers/ata between rpm-5.3.18-57..rpm-5.3.18-59.5. As it looks like a PCI regression, I checked the changes in drivers/pci and there is a notable change due to bsc#1174426. No idea if they could be the source of the regression. Adding Joerg, who might be able to say something about bsc#1174426 and this report. -- You are receiving this mail because: You are the assignee for the bug.
http://bugzilla.opensuse.org/show_bug.cgi?id=1187701 http://bugzilla.opensuse.org/show_bug.cgi?id=1187701#c18 J�rgen Gro� <jgross@suse.com> changed: What |Removed |Added ---------------------------------------------------------------------------- CC|jgross@suse.com |jroedel@suse.com Flags|needinfo?(jgross@suse.com) |needinfo?(jroedel@suse.com) --- Comment #18 from J�rgen Gro� <jgross@suse.com> --- I'm not Joerg. Adding him. -- You are receiving this mail because: You are the assignee for the bug.
http://bugzilla.opensuse.org/show_bug.cgi?id=1187701 http://bugzilla.opensuse.org/show_bug.cgi?id=1187701#c19 --- Comment #19 from Joerg Roedel <jroedel@suse.com> --- (In reply to Daniel Wagner from comment #17)
There were no real changes for arch/arm64 and drivers/ata between rpm-5.3.18-57..rpm-5.3.18-59.5.
As it looks like a PCI regression, I checked the changes in drivers/pci and there is a notable change due to bsc#1174426. No idea if they could be the source of the regression.
Adding Joerg, who might be able to say something about bsc#1174426 and this report.
It is possible that the patches from bsc#1174426 are related to this. But as I am no ARM export, I need to know first what this error exactly mean: Internal error: synchronous external abort: 96000210 [#1] SMP What is a synchronous external abort and what does the number mean? -- You are receiving this mail because: You are the assignee for the bug.
http://bugzilla.opensuse.org/show_bug.cgi?id=1187701 http://bugzilla.opensuse.org/show_bug.cgi?id=1187701#c20 --- Comment #20 from Daniel Wagner <daniel.wagner@suse.com> --- No ARM expert either. Internal error: synchronous external abort: 96000210 [#1] SMP Stackoverflow[1] says: """ The ARMv7 ARM section "VMSA Memory aborts" covers this as thoroughly as one would expect (given that it's the authoritative definition of the architecture), but to summarise in slightly less than 14 pages; An abort means the CPU tried to make a memory access, which for whatever reason, couldn't be completed so raises an exception. An external abort is one from, well, externally to the processor, i.e. something on the bus. In other words, the access didn't fault in the MMU, went out onto the bus, and either some device or the interconnect itself came back and said "hey, I can't deal with this". A synchronous external abort means you're rather fortunate, in that it's not going to be utterly hideous to debug - in the case of a prefetch abort, it means the IFAR is going to contain a valid VA for the faulting instruction, so you know exactly what caused it. The unpleasant alternative is an asynchronous external abort, which is little more than an interrupt to say "hey, something you did a while ago didn't actually work. No I don't know what is was either." """ and the number is the ESR register content. Seems to be a very common register value. Found a lot of bug reports with this value. There is a bit of documentation on it [2] and [3]. Not sure if this helps at all. [1] https://stackoverflow.com/questions/27507013/synchronous-external-abort-on-a... [2] https://developer.arm.com/documentation/ddi0595/2020-12/AArch64-Registers/ES... [2] https://developer.arm.com/documentation/den0024/a/AArch64-Exception-Handling -- You are receiving this mail because: You are the assignee for the bug.
http://bugzilla.opensuse.org/show_bug.cgi?id=1187701 http://bugzilla.opensuse.org/show_bug.cgi?id=1187701#c21 --- Comment #21 from Daniel Wagner <daniel.wagner@suse.com> --- D13.2.37 ESR_EL1, Exception Syndrome Register (EL1) EC, bits [31:26] EC == 0b100101 Data Abort taken without a change in Exception level. Used for MMU faults generated by data accesses, alignment faults other than those caused by Stack Pointer misalignment, and synchronous External aborts, including synchronous parity or ECC errors. Not used for debug-related exceptions. IL, bit [25] 0b1 32-bit instruction trapped. This value is also used when the exception is one of the following: An SError interrupt. [...] ISS, bits [24:0] Instruction Specific Syndrome. Architecturally, this field can be defined independently for each defined Exception class. However, in practice, some ISS encodings are used for more than one Exception class. ISS encoding for an SError interrupt EA, bit [9] External abort type DFSC, bits [5:0] 0b000000 Uncategorized error. 0b010001 Asynchronous SError interrupt. I couldn't find a 0b010000 value but I don't expect a lot to figure out. crash> dis ahci_enable_ahci+32 0xffffb4715d00ae88 <ahci_enable_ahci+32>: ldr w19, [x20] -- You are receiving this mail because: You are the assignee for the bug.
http://bugzilla.opensuse.org/show_bug.cgi?id=1187701 http://bugzilla.opensuse.org/show_bug.cgi?id=1187701#c22 --- Comment #22 from Daniel Wagner <daniel.wagner@suse.com> --- #define ESR_ELx_FSC (0x3F) static inline const struct fault_info *esr_to_fault_info(unsigned int esr) { return fault_info + (esr & ESR_ELx_FSC); } Ok, it's just the index and the fault_info table contains
{ do_sea, SIGBUS, BUS_OBJERR, "synchronous external abort" },
TL;DR: the value says 'synchronous external abort' :) -- You are receiving this mail because: You are the assignee for the bug.
http://bugzilla.opensuse.org/show_bug.cgi?id=1187701 http://bugzilla.opensuse.org/show_bug.cgi?id=1187701#c30 Daniel Wagner <daniel.wagner@suse.com> changed: What |Removed |Added ---------------------------------------------------------------------------- Status|NEW |RESOLVED Resolution|--- |NORESPONSE --- Comment #30 from Daniel Wagner <daniel.wagner@suse.com> --- No feedback for a while. Please reopen if this is still a problem. -- You are receiving this mail because: You are the assignee for the bug.
http://bugzilla.opensuse.org/show_bug.cgi?id=1187701 http://bugzilla.opensuse.org/show_bug.cgi?id=1187701#c31 Mathew McBride <matt@traverse.com.au> changed: What |Removed |Added ---------------------------------------------------------------------------- Status|RESOLVED |REOPENED Resolution|NORESPONSE |--- --- Comment #31 from Mathew McBride <matt@traverse.com.au> --- Apologies, I got distracted on other projects. This is still a problem, thankfully it does not exist with the newer kernel on Leap 15.4. I have even attempted to bisect this but can't find any obvious cause. I think it came with this merge: commit 691336b369202a4fd8d75a0a255a88ef592f00d9 Merge: 30270eafb6d5 2c269e3edad3 Author: Denis Kirjanov <dkirjanov@suse.com> Date: Fri May 21 15:57:15 2021 +0300 Merge branch 'users/dkirjanov/SLE15-SP3-UPDATE/for-next' into SLE15-SP3-UPDATE Merge SLE15-SP2 into SLE15-SP3-UPDATE with a kabi fix from Michal Suchanek suse-commit: d970db8f5beecbc55b223922ed9f3d6dbb885814 A kernel compiled on the commit immediately prior (according to Github) - d4cb7424a2afeb740fd05840619e26de413aed45 "lpfc: Decouple port_template and vport_template (bsc#185032)" works.
Might this be an issue with the DTB? Can you please try to boot the broken kernel with a DTB from the good kernel and see whether this makes a difference?
Note that I've grabbed the dmesg's/kernel logs from several different systems, depending on what was in front of me at the time. On the Ten64 the DTB is built into the flash and passed to the kernel via EFI, so the DTB is from a 'good kernel' in that respect. I can try producing a DTB from the Leap/SLE kernel source if needed. -- You are receiving this mail because: You are the assignee for the bug.
participants (1)
-
bugzilla_noreply@suse.com