[kernel-bugs] [Bug 1177800] New: [Ten64] getsysinfo caused kernel error (synchronous external abort)
http://bugzilla.opensuse.org/show_bug.cgi?id=1177800 Bug ID: 1177800 Summary: [Ten64] getsysinfo caused kernel error (synchronous external abort) Classification: openSUSE Product: openSUSE Tumbleweed Version: Current Hardware: aarch64 OS: openSUSE Tumbleweed Status: NEW Severity: Major Priority: P5 - None Component: Kernel Assignee: kernel-bugs@opensuse.org Reporter: afaerber@suse.com QA Contact: qa-bugs@suse.de CC: matt@traverse.com.au, snwint@suse.com, yousaf.kaukab@suse.com Found By: --- Blocker: --- Running `getsysinfo` (via ssh) on Tumbleweed 20201011 with kernel-default 5.9.0 from Kernel:HEAD repository caused a kernel error, with ssh getting stuck and reconnections failing. Serial login still worked. zehn:~ # getsysinfo /proc/bus/input /proc/cpuinfo /proc/device-tree /proc/devices /proc/fb /proc/filesystems /proc/interrupts /proc/iomem /proc/ioports /proc/meminfo /proc/modules /proc/net/dev /proc/partitions /proc/scsi /proc/tty /proc/version /sys /usr/sbin/getsysinfo: line 23: 2761 Segmentation fault cp -x -a --parents "$i" "$dir/$host" 2> /dev/null /var/lib/hardware/udi /proc/mounts System data written to: /tmp/zehn.tar.gz zehn:~ # [ 544.956478] Internal error: synchronous external abort: 96000210 [#1] SMP [ 544.963295] Modules linked in: af_packet ip6t_REJECT nf_reject_ipv6 ip6t_rpfilter xt_tcpudp ipt_REJECT nf_reject_ipv4 xt_conntrack ebtable_nat ebtable_broute ip6table_nat ip6table_mangle ip6table_raw ip6table_security iptable_nat nf_nat nf_conntrack nf_defrag_ipv6 nf_defrag_ipv4 iptable_mangle iptable_raw iptable_security iscsi_ibft iscsi_boot_sysfs ip_set nfnetlink ebtable_filter ebtables rfkill ip6table_filter ip6_tables iptable_filter ip_tables x_tables fsl_dpaa2_ptp ptp_qoriq fsl_dpaa2_eth phylink xgmac_mdio hid_generic fsl_mc_dpio usbhid cdc_acm i2c_mux_pca954x i2c_mux tpm_i2c_atmel spi_fsl_qspi qoriq_thermal leds_gpio optee tee uio_pdrv_genirq uio qoriq_cpufreq nls_iso8859_1 nls_cp437 vfat fat drm xhci_plat_hcd xhci_hcd usbcore caam_jr mmc_block libdes authenc caamhash_desc caamalg_desc crypto_engine rtc_ds1307 mp886x aes_ce_blk crypto_simd cryptd aes_ce_cipher crct10dif_ce ghash_ce sha2_ce sha256_arm64 sha1_ce dpaa2_console nvme nvme_core dwc3 sdhci_of_esdhc caam sdhci_pltfm ulpi [ 544.963547] error sdhci udc_core roles mmc_core i2c_imx btrfs blake2b_generic libcrc32c xor xor_neon raid6_pq sg dm_multipath dm_mod scsi_dh_rdac scsi_dh_emc scsi_dh_alua [ 545.066303] CPU: 0 PID: 2761 Comm: cp Not tainted 5.9.0-1.g11733e1-default #1 openSUSE Tumbleweed (unreleased) [ 545.076303] Hardware name: traverse ten64/ten64, BIOS 2020.07-rc1-g1d4a3d9d5c 07/28/2020 [ 545.084393] pstate: 60000085 (nZCv daIf -PAN -UAO BTYPE=--) [ 545.089967] pc : dw_pcie_read+0x48/0xc0 [ 545.093799] lr : dw_pcie_access_other_conf.isra.0+0xc4/0x120 [ 545.099452] sp : ffff800012203bf0 [ 545.102759] x29: ffff800012203bf0 x28: 0000000000000400 [ 545.108067] x27: ffff00833e683400 x26: ffff008348240000 [ 545.113375] x25: ffff800010085000 x24: 0000000000000000 [ 545.118683] x23: ffff00834b239880 x22: ffff800012203cc4 [ 545.123991] x21: 0000000000000004 x20: 0000000000000400 [ 545.129299] x19: ffff00834b2398a8 x18: 0000000000000000 [ 545.134607] x17: 0000000000000000 x16: ffffafadaf1be170 [ 545.139915] x15: 0000000000000000 x14: 0000000000000000 [ 545.145222] x13: 0000000000000000 x12: 0000000000000040 [ 545.150530] x11: ffff00834bd8c920 x10: 0000000000000000 [ 545.155837] x9 : ffffafadaf12e8f4 x8 : 0000000002080000 [ 545.161145] x7 : 0000000000000000 x6 : ffff800010f00000 [ 545.166452] x5 : 0000000000000000 x4 : 0000000000000908 [ 545.171759] x3 : 0000000000000003 x2 : ffff800012203cc4 [ 545.177066] x1 : 0000000000000004 x0 : ffff800010085400 [ 545.182374] Call trace: [ 545.184816] dw_pcie_read+0x48/0xc0 [ 545.188298] dw_pcie_rd_conf+0x11c/0x150 [ 545.192217] pci_user_read_config_dword+0xa8/0x190 [ 545.197004] pci_read_config+0x1f8/0x264 [ 545.200923] sysfs_kf_bin_read+0x78/0xa0 [ 545.204840] kernfs_file_direct_read+0x90/0x220 [ 545.209365] kernfs_fop_read+0x44/0x50 [ 545.213111] vfs_read+0xb8/0x1e4 [ 545.216334] ksys_read+0x78/0x110 [ 545.219643] __arm64_sys_read+0x28/0x34 [ 545.223476] el0_svc_common.constprop.0+0x84/0x230 [ 545.228261] do_el0_svc+0x30/0xa0 [ 545.231572] el0_svc+0x18/0x50 [ 545.234621] el0_sync_handler+0x90/0x254 [ 545.238538] el0_sync+0x158/0x180 [ 545.241849] Code: 528010e0 d50323bf b900005f d65f03c0 (b9400001) [ 545.247941] ---[ end trace 68383e7eecaae870 ]--- -- You are receiving this mail because: You are the assignee for the bug.
http://bugzilla.opensuse.org/show_bug.cgi?id=1177800 http://bugzilla.opensuse.org/show_bug.cgi?id=1177800#c1 --- Comment #1 from Andreas Färber <afaerber@suse.com> --- (In reply to Andreas Färber from comment #0)
Running `getsysinfo` (via ssh) on Tumbleweed 20201011 with kernel-default 5.9.0 from Kernel:HEAD repository caused a kernel error, with ssh getting stuck and reconnections failing. Serial login still worked.
Correction: I got the login prompt on Enter key, but login timed out. After reset and reboot the /tmp tarball was gone. Sadly getsysinfo did not support any command line argument for output location. Re-trying, I was able to log in via serial, copy the file elsewhere, but during reboot got stuck in RCU errors and had to reset again. -- You are receiving this mail because: You are the assignee for the bug.
http://bugzilla.opensuse.org/show_bug.cgi?id=1177800 http://bugzilla.opensuse.org/show_bug.cgi?id=1177800#c2 --- Comment #2 from Andreas Färber <afaerber@suse.com> --- Similar issue with Tumbleweed's 5.8.14 - ssh still working (to exit) but serial running into watchdog BUGs afterwards. [ 171.622422] Internal error: synchronous external abort: 96000210 [#1] SMP [ 171.629242] Modules linked in: af_packet ip6t_REJECT nf_reject_ipv6 ip6t_rpfilter xt_tcpudp ipt_REJECT nf_reject_ipv4 xt_conntrack ebtable_nat ebtable_broute ip6table_nat ip6table_mangle ip6table_raw ip6table_security iptable_nat nf_nat nf_conntrack nf_defrag_ipv6 nf_defrag_ipv4 iptable_mangle iptable_raw iptable_security iscsi_ibft iscsi_boot_sysfs ip_set nfnetlink ebtable_filter ebtables rfkill ip6table_filter ip6_tables iptable_filter ip_tables x_tables fsl_dpaa2_ptp fsl_dpaa2_eth ptp_qoriq phylink hid_generic fsl_mc_dpio xgmac_mdio usbhid cdc_acm i2c_mux_pca954x i2c_mux tpm_i2c_atmel qoriq_thermal spi_fsl_qspi optee tee uio_pdrv_genirq uio leds_gpio qoriq_cpufreq nls_iso8859_1 nls_cp437 vfat fat drm xhci_plat_hcd xhci_hcd usbcore mmc_block caam_jr rtc_ds1307 libdes authenc caamhash_desc caamalg_desc crypto_engine mp886x aes_ce_blk crypto_simd cryptd aes_ce_cipher crct10dif_ce ghash_ce sha2_ce sha256_arm64 sha1_ce dpaa2_console nvme sdhci_of_esdhc sdhci_pltfm nvme_core dwc3 sdhci caam [ 171.629496] mmc_core error ulpi udc_core roles i2c_imx btrfs blake2b_generic libcrc32c xor xor_neon raid6_pq sg dm_multipath dm_mod scsi_dh_rdac scsi_dh_emc scsi_dh_alua [ 171.732254] CPU: 0 PID: 2539 Comm: cp Not tainted 5.8.14-1-default #1 openSUSE Tumbleweed [ 171.740429] Hardware name: traverse ten64/ten64, BIOS 2020.07-rc1-g1d4a3d9d5c 07/28/2020 [ 171.748519] pstate: 60000085 (nZCv daIf -PAN -UAO BTYPE=--) [ 171.754092] pc : dw_pcie_read+0x48/0xc0 [ 171.757925] lr : dw_pcie_access_other_conf.isra.0+0xcc/0x124 [ 171.763578] sp : ffff800012153be0 [ 171.766886] x29: ffff800012153be0 x28: 0000000000000400 [ 171.772194] x27: ffff00833f992400 x26: ffff0083481c2000 [ 171.777501] x25: ffff80001008d000 x24: 0000000000000000 [ 171.782809] x23: ffff800012153cc4 x22: 0000000000000004 [ 171.788117] x21: ffff00834b214080 x20: 0000000000000400 [ 171.793424] x19: ffff00834b2140a8 x18: 0000000000000000 [ 171.798732] x17: 0000000000000000 x16: ffffc50cf0986f10 [ 171.804039] x15: 0000000000000000 x14: 0000000000000000 [ 171.809347] x13: 0000000000000000 x12: 0000000000000040 [ 171.814654] x11: ffff00834bd8d488 x10: 0000000000000001 [ 171.819962] x9 : ffffc50cf1047ecc x8 : 0000000002080000 [ 171.825270] x7 : 0000000000000000 x6 : ffff800010f00000 [ 171.830578] x5 : 0000000000000000 x4 : 0000000000000908 [ 171.835885] x3 : 0000000000000003 x2 : ffff800012153cc4 [ 171.841193] x1 : 0000000000000004 x0 : ffff80001008d400 [ 171.846502] Call trace: [ 171.848944] dw_pcie_read+0x48/0xc0 [ 171.852428] dw_pcie_rd_conf+0x148/0x180 [ 171.856347] pci_user_read_config_dword+0xa8/0x190 [ 171.861135] pci_read_config+0x1f8/0x264 [ 171.865054] sysfs_kf_bin_read+0x78/0xa0 [ 171.868971] kernfs_file_direct_read+0x90/0x220 [ 171.873497] kernfs_fop_read+0x44/0x50 [ 171.877241] vfs_read+0xb8/0x1d0 [ 171.880464] ksys_read+0x78/0x10c [ 171.883773] __arm64_sys_read+0x28/0x34 [ 171.887604] el0_svc_common.constprop.0+0x84/0x230 [ 171.892390] do_el0_svc+0x30/0xa0 [ 171.895700] el0_svc+0x18/0x50 [ 171.898749] el0_sync_handler+0x90/0x254 [ 171.902666] el0_sync+0x158/0x180 [ 171.905979] Code: 528010e0 d50323bf b900005f d65f03c0 (b9400001) [ 171.912069] ---[ end trace 54772d5159fa3103 ]--- -- You are receiving this mail because: You are the assignee for the bug.
http://bugzilla.opensuse.org/show_bug.cgi?id=1177800 http://bugzilla.opensuse.org/show_bug.cgi?id=1177800#c3 Andreas Färber <afaerber@suse.com> changed: What |Removed |Added ---------------------------------------------------------------------------- CC| |jcheung@suse.com, | |lpechacek@suse.com, | |ptesarik@suse.com --- Comment #3 from Andreas Färber <afaerber@suse.com> --- For comparison, on a MacchiatoBin with kernel 5.8.14 `getsysinfo` also produces one kernel error on serial console, but the machine remains usable: [ 1252.719947] BUG: Bad page state in process getsysinfo pfn:7fe40 mack:~ # getsysinfo /proc/bus/input /proc/cpuinfo /proc/device-tree /proc/devices /proc/fb /proc/filesystems /proc/interrupts /proc/iomem /proc/ioports /proc/meminfo /proc/modules /proc/net/dev /proc/partitions /proc/scsi /proc/tty /proc/version /sys /var/lib/hardware/udi /proc/mounts System data written to: /tmp/mack.tar.gz mack:~ # -- You are receiving this mail because: You are the assignee for the bug.
http://bugzilla.opensuse.org/show_bug.cgi?id=1177800 http://bugzilla.opensuse.org/show_bug.cgi?id=1177800#c4 --- Comment #4 from Andreas Färber <afaerber@suse.com> --- Confirming that running `/usr/sbin/getsysinfo` as non-root user works okay. -- You are receiving this mail because: You are the assignee for the bug.
http://bugzilla.opensuse.org/show_bug.cgi?id=1177800 http://bugzilla.opensuse.org/show_bug.cgi?id=1177800#c6 --- Comment #6 from Mathew McBride <matt@traverse.com.au> --- I had a fresh look into this today and managed to find the cause of the problem! In summary the Layerscape PCIe controller generates a synchronous abort related to reading PCI config data for the PCIe switch/bridge. This read does not happen in normal operation but is triggered by getsysinfo archiving/enumerating the /sys tree, where one can read out the pci config register as a file. The synchronous abort problem exists in mainline kernels / non SUSE systems as well. The Ten64 retail (1064-0201C) board has a Diodes/Pericom PI7C9X2G304SV PCIe switch to split 1xPCIe lane to 2xPCIe 2.0 for the miniPCIe slots lspci -nn 0000:00:00.0 PCI bridge [0604]: Freescale Semiconductor Inc Device [1957:80c0] (rev 10) 0001:00:00.0 PCI bridge [0604]: Freescale Semiconductor Inc Device [1957:80c0] (rev 10) 0001:01:00.0 PCI bridge [0604]: Pericom Semiconductor Device [12d8:b304] (rev 01) 0001:02:01.0 PCI bridge [0604]: Pericom Semiconductor Device [12d8:b304] (rev 01) 0001:02:02.0 PCI bridge [0604]: Pericom Semiconductor Device [12d8:b304] (rev 01) 0001:03:00.0 Unclassified device [0002]: MEDIATEK Corp. MT7915E 802.11ax PCI Express Wireless Network Adapter [14c3:7915] 0001:04:00.0 Network controller [0280]: Qualcomm Atheros QCA986x/988x 802.11ac Wireless Network Adapter [168c:003c] 0002:00:00.0 PCI bridge [0604]: Freescale Semiconductor Inc Device [1957:80c0] (rev 10) 0002:01:00.0 Non-Volatile memory controller [0108]: Samsung Electronics Co Ltd NVMe SSD Controller PM9A1/PM9A3/980PRO [144d:a80a] root@recovery000afa24295d:/tmp# lspci -tnn -+-[0002:00]---00.0-[01-ff]----00.0 +-[0001:00]---00.0-[01-ff]----00.0-[02-04]--+-01.0-[03]----00.0 | \-02.0-[04]----00.0 If the PCIe switch is hidden (disable it's upstream PCIe controller in the FDT blob) or missing (it's been removed from some Ten64 board variants), the problem does not occur and getsysinfo will not cause a panic. FreeBSD had a similar issue and the cause sounds very similar to what is happening here. "pci: Don't try to read cfg registers of non-existing devices Instead of returning 0xffs some controllers, such as Layerscape generate an external exception when someone attempts to read any register of config space of a non-existing device other than PCIR_VENDOR. This causes a kernel panic. Fix it by bailing during device enumeration if a device vendor register returns invalid value. (0xffff) Use this opportunity to replace some hardcoded values with a macro." From https://cgit.freebsd.org/src/commit/?id=68cbe189fdd3c572476f8af9219a5d335f05... I have been able to isolate it down to the 'config' sysfs file, here is a reduced testcase: for i in $(find /sys/devices/platform/soc/3500000.pcie -type f); do echo "Opening $i" echo "------------------------------------------" sleep 1 # allow time for console to flush cat $i echo "------------------------------------------" done .... ------------------------------------------ Opening /sys/devices/platform/soc/3500000.pcie/pci0001:00/0001:00:00.0/0001:01:00.0/0001:02:02.0/config ------------------------------------------ [ 150.192901] Internal error: synchronous external abort: 96000210 [#1] SMP I have verified the problem exists on non-SUSE systems so it's just a kernel bug (including 5.19.0-rc5) which getsysinfo triggers. Here is the trace from the latest Tumbleweed snapshot: openSUSE-Tumbleweed-ARM-JeOS-efi.aarch64-2022.07.01-Snapshot20220704.raw.xz Linux localhost.localdomain 5.18.6-1-default #1 SMP PREEMPT_DYNAMIC Thu Jun 23 05:46:18 UTC 2022 (5aa0763) aarch64 aarch64 aarch64 GNU/Linux [ 36.849750][ T2016] Internal error: synchronous external abort: 96000210 [#1] SMP [ 36.857252][ T2016] Modules linked in: af_packet mt7915e ath10k_pci ath10k_co re mt76_connac_lib mt76 ath mac80211 libarc4 fsl_dpaa2_eth pcs_lynx cfg80211 phy link rfkill i2c_mux_pca954x i2c_mux pci_endpoint_test tpm_i2c_atmel qoriq_therma l tee sfp uio_pdrv_genirq mdio_i2c leds_gpio uio qoriq_cpufreq nls_iso8859_1 nls _cp437 vfat fat fuse drm ip_tables x_tables xhci_plat_hcd xhci_hcd caam_jr crypt o_engine usbcore dpaa2_caam caamhash_desc caamalg_desc aes_ce_blk aes_ce_cipher crct10dif_ce ghash_ce gf128mul sha2_ce sha256_arm64 sha1_ce sp805_wdt fsl_mc_dpi o dpaa2_console authenc libdes caam nvme nvme_core error dwc3 sdhci_of_esdhc sdh ci_pltfm sdhci udc_core rtc_fsl_ftm_alarm roles mmc_core ulpi i2c_imx usb_common gpio_keys btrfs blake2b_generic xor xor_neon raid6_pq libcrc32c dm_mirror dm_re gion_hash dm_log dm_mod sg [ 36.929404][ T2016] CPU: 0 PID: 2016 Comm: cp Not tainted 5.18.6-1-default #1 openSUSE Tumbleweed a3ce01492e87efb4fa7f3baf169c992c0c69c4b7 [ 36.941846][ T2016] Hardware name: traverse ten64/ten64, BIOS 2020.07-rc1-ga9 4e0d21 03/15/2022 [ 36.950460][ T2016] pstate: 204000c5 (nzCv daIF +PAN -UAO -TCO -DIT -SSBS BTY PE=--) [ 36.958119][ T2016] pc : pci_generic_config_read+0x44/0xcc [ 36.963613][ T2016] lr : pci_generic_config_read+0x30/0xcc [ 36.969099][ T2016] sp : ffff80000a31b9f0 [ 36.973105][ T2016] x29: ffff80000a31b9f0 x28: ffff08be45472400 x27: 00000000 00000400 [ 36.980941][ T2016] x26: 00000000000003ff x25: ffff08be45472000 x24: 00000000 00001000 [ 36.988779][ T2016] x23: 0000000000001000 x22: ffff80000a31bae4 x21: ffffbac4 1ea22fa0 [ 36.996616][ T2016] x20: ffff80000a31ba64 x19: 0000000000000004 x18: 00000000 00000000 [ 37.004453][ T2016] x17: 0000000000000000 x16: 0000000000000000 x15: 00000000 00000000 [ 37.012289][ T2016] x14: 0000000000000000 x13: 0000000000000000 x12: 00000000 00000000 [ 37.020132][ T2016] x11: 0000000000000000 x10: 0000000000000000 x9 : ffffbac4 1ca785dc [ 37.027975][ T2016] x8 : 0000000000000004 x7 : ffff800008e00000 x6 : ffff8000 08e00000 [ 37.035816][ T2016] x5 : ffff08be41a93c80 x4 : 0000000000000908 x3 : 00000000 00000000 [ 37.043656][ T2016] x2 : 0000000000000000 x1 : ffff08be4a0de000 x0 : ffff8000 08202400 [ 37.051494][ T2016] Call trace: [ 37.054631][ T2016] pci_generic_config_read+0x44/0xcc [ 37.059774][ T2016] dw_pcie_rd_other_conf+0x24/0x7c [ 37.064741][ T2016] pci_user_read_config_dword+0x84/0x124 [ 37.070229][ T2016] pci_read_config+0xf0/0x2a0 [ 37.074760][ T2016] sysfs_kf_bin_read+0x78/0xa0 [ 37.079378][ T2016] kernfs_fop_read_iter+0xac/0x1d4 [ 37.084344][ T2016] new_sync_read+0xd8/0x160 [ 37.088700][ T2016] vfs_read+0x19c/0x1e4 [ 37.092710][ T2016] ksys_read+0x78/0x10c [ 37.096718][ T2016] __arm64_sys_read+0x28/0x34 [ 37.101248][ T2016] invoke_syscall+0x78/0x100 [ 37.105693][ T2016] el0_svc_common.constprop.0+0x58/0x190 [ 37.111181][ T2016] do_el0_svc+0x30/0x90 [ 37.115191][ T2016] el0_svc+0x34/0x130 [ 37.119029][ T2016] el0t_64_sync_handler+0x10c/0x140 [ 37.124080][ T2016] el0t_64_sync+0x1a0/0x1a4 [ 37.128439][ T2016] Code: 7100067f 540001c0 71000a7f 54000280 (b9400001) [ 37.135228][ T2016] ---[ end trace 0000000000000000 ]--- [ 37.140539][ T2016] note: cp[2016] exited with preempt_count 1 And from Leap 15.4: Linux localhost 5.14.21-150400.22-default #1 SMP PREEMPT_DYNAMIC Wed May 11 06:57:18 UTC 2022 (49db222) aarch64 aarch64 aarch64 GNU/Linux [ 445.922445][ T2950] Call trace: [ 445.925582][ T2950] pci_generic_config_read+0x40/0x100 [ 445.930810][ T2950] dw_pcie_rd_other_conf+0x20/0x80 [ 445.935777][ T2950] pci_user_read_config_dword+0x88/0x140 -- You are receiving this mail because: You are the assignee for the bug.
participants (1)
-
bugzilla_noreply@suse.com