[Bug 1220541] New: kexec does a full reboot with kernel 6.7.6-1.1
https://bugzilla.suse.com/show_bug.cgi?id=1220541 Bug ID: 1220541 Summary: kexec does a full reboot with kernel 6.7.6-1.1 Classification: openSUSE Product: openSUSE Tumbleweed Version: Current Hardware: x86-64 OS: openSUSE Tumbleweed Status: NEW Severity: Major Priority: P5 - None Component: Kernel Assignee: kernel-bugs@opensuse.org Reporter: me@pavinjoseph.com QA Contact: qa-bugs@suse.de Target Milestone: --- Found By: --- Blocker: --- Tested on identical systems with kexec-tools 2.0.27-3.2 (on both) and kernel 6.7.5-1.1 (on working system) and 6.7.6-1.1 (on faulty system). Faulty system (fully updated as of Feb 28 2024) is on Tumbleweed release 20240226. Working system is on release 20240222. Issue reproduced on working system by doing zypper ref and zypper in kernel-default to upgrade just the kernel to the latest version. Issue happens after the next cold boot and persists. -- You are receiving this mail because: You are the assignee for the bug.
https://bugzilla.suse.com/show_bug.cgi?id=1220541 Pavin Joseph <me@pavinjoseph.com> changed: What |Removed |Added ---------------------------------------------------------------------------- Priority|P5 - None |P2 - High -- You are receiving this mail because: You are the assignee for the bug.
https://bugzilla.suse.com/show_bug.cgi?id=1220541 https://bugzilla.suse.com/show_bug.cgi?id=1220541#c3 --- Comment #3 from Pavin Joseph <me@pavinjoseph.com> --- I did more testing today. Kexec reboot working normally on kernels: 6.7.4 6.4.0 ALP kernel (https://download.opensuse.org/repositories/Kernel:/ALP-current/standard/x86_...) Kexec reboot does firmware reboot on kernels: 6.7.6 6.6.18 longterm kernel (https://download.opensuse.org/update/slowroll/repo/oss/x86_64/kernel-longter...) Let me know if there's anything else I can provide for troubleshooting. -- You are receiving this mail because: You are the assignee for the bug.
https://bugzilla.suse.com/show_bug.cgi?id=1220541 https://bugzilla.suse.com/show_bug.cgi?id=1220541#c5 --- Comment #5 from Pavin Joseph <me@pavinjoseph.com> --- @Takashi Thanks for the references. I'm quite out of my depth here with building kernels and reporting bugs straight to kernel.org. Guess I better get learning 🤓 -- You are receiving this mail because: You are the assignee for the bug.
https://bugzilla.suse.com/show_bug.cgi?id=1220541 https://bugzilla.suse.com/show_bug.cgi?id=1220541#c9 --- Comment #9 from Pavin Joseph <me@pavinjoseph.com> --- Hi there, Did the full bisection and found the culprit. Didn't quite understand the whole procedure until reading this [0] guide. Issue reproduced on mainline and current stable 6.7.7. Submitted response to upstream detailing all this. Hope it's fixed soon, let me know if there's anything I can do to improve testing for kexec bugs using OpenQA or OBS? This bug found its way into kernel-longterm as well and as a feature I use almost every day (my personal machine's firmware is quite slow) it's quite concerning no one caught this in testing. Bisection logs: git bisect start # status: waiting for both good and bad commits # good: [004dcea13dc10acaf1486d9939be4c793834c13c] Linux 6.7.5 git bisect good 004dcea13dc10acaf1486d9939be4c793834c13c # status: waiting for bad commit, 1 good commit known # bad: [b631f5b445dc3379f67ff63a2e4c58f22d4975dc] Linux 6.7.6 git bisect bad b631f5b445dc3379f67ff63a2e4c58f22d4975dc # good: [00c48bfbd6b29b8ebf64edd059dbf9e95cedd5b1] misc: fastrpc: Mark all sessions as invalid in cb_remove git bisect good 00c48bfbd6b29b8ebf64edd059dbf9e95cedd5b1 # bad: [6e85c91e7d63e46de1b4a0cb90212356da8a41cb] io_uring/net: fix multishot accept overflow handling git bisect bad 6e85c91e7d63e46de1b4a0cb90212356da8a41cb # good: [fe32ecf2e66f069230628e8917d26911c5fb2482] eventfs: Restructure eventfs_inode structure to be more condensed git bisect good fe32ecf2e66f069230628e8917d26911c5fb2482 # good: [f385565bd76b581a83b62a5b6f88ea6f149f8b83] ring-buffer: Clean ring_buffer_poll_wait() error return git bisect good f385565bd76b581a83b62a5b6f88ea6f149f8b83 # good: [992c8a5f10f81af32c3272c200fc003fb7450401] powerpc/64: Set task pt_regs->link to the LR value on scv entry git bisect good 992c8a5f10f81af32c3272c200fc003fb7450401 # good: [d79adbe1cd67bc76608e036ee2f98b71c083d9ce] x86/Kconfig: Transmeta Crusoe is CPU family 5, not 6 git bisect good d79adbe1cd67bc76608e036ee2f98b71c083d9ce # good: [fa2b524a73545d25ae15e3d2930b9bfa83b40827] KVM: x86: make KVM_REQ_NMI request iff NMI pending for vcpu git bisect good fa2b524a73545d25ae15e3d2930b9bfa83b40827 # bad: [7143c5f4cf2073193eb27c9cdb84fd4655d1802d] x86/mm/ident_map: Use gbpages only where full GB page should be mapped. git bisect bad 7143c5f4cf2073193eb27c9cdb84fd4655d1802d # good: [6d10c8c5abd1437dcbc209e307d930da60b86e91] KVM: x86/pmu: Fix type length error when reading pmu->fixed_ctr_ctrl git bisect good 6d10c8c5abd1437dcbc209e307d930da60b86e91 # first bad commit: [7143c5f4cf2073193eb27c9cdb84fd4655d1802d] x86/mm/ident_map: Use gbpages only where full GB page should be mapped. Culprit: 7143c5f4cf2073193eb27c9cdb84fd4655d1802d is the first bad commit commit 7143c5f4cf2073193eb27c9cdb84fd4655d1802d Author: Steve Wahl <steve.wahl@hpe.com> Date: Fri Jan 26 10:48:41 2024 -0600 x86/mm/ident_map: Use gbpages only where full GB page should be mapped. commit d794734c9bbfe22f86686dc2909c25f5ffe1a572 upstream. When ident_pud_init() uses only gbpages to create identity maps, large ranges of addresses not actually requested can be included in the resulting table; a 4K request will map a full GB. On UV systems, this ends up including regions that will cause hardware to halt the system if accessed (these are marked "reserved" by BIOS). Even processor speculation into these regions is enough to trigger the system halt. Only use gbpages when map creation requests include the full GB page of space. Fall back to using smaller 2M pages when only portions of a GB page are included in the request. No attempt is made to coalesce mapping requests. If a request requires a map entry at the 2M (pmd) level, subsequent mapping requests within the same 1G region will also be at the pmd level, even if adjacent or overlapping such requests could have been combined to map a full gbpage. Existing usage starts with larger regions and then adds smaller regions, so this should not have any great consequence. [ dhansen: fix up comment formatting, simplifty changelog ] Signed-off-by: Steve Wahl <steve.wahl@hpe.com> Signed-off-by: Dave Hansen <dave.hansen@linux.intel.com> Cc: stable@vger.kernel.org Link: https://lore.kernel.org/all/20240126164841.170866-1-steve.wahl%40hpe.com Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org> arch/x86/mm/ident_map.c | 23 ++++++++++++++++++----- 1 file changed, 18 insertions(+), 5 deletions(-) [0]: https://www.leemhuis.info/files/misc/How%20to%20bisect%20a%20Linux%20kernel%... -- You are receiving this mail because: You are the assignee for the bug.
https://bugzilla.suse.com/show_bug.cgi?id=1220541 https://bugzilla.suse.com/show_bug.cgi?id=1220541#c16 --- Comment #16 from Pavin Joseph <me@pavinjoseph.com> --- (In reply to Jiri Slaby from comment #15)
As I understand the thread, you have not tried the latest mainline. So is this reproducible with 6.8-rc*?
I have reproduced the issue on mainline, current stable (6.7.7), and a full git bisection was done between the last known good version 6.7.5 and the first known bad version 6.7.6. Reverting culprit commit on mainline fixed the issue. https://lore.kernel.org/regressions/fe72c912-f1a0-4a53-88ab-b85e8c3f7bd9@pav... -- You are receiving this mail because: You are the assignee for the bug.
https://bugzilla.suse.com/show_bug.cgi?id=1220541 https://bugzilla.suse.com/show_bug.cgi?id=1220541#c18 --- Comment #18 from Pavin Joseph <me@pavinjoseph.com> --- (In reply to Jiri Slaby from comment #17)
Note 6.7 is *not* mainline. That's why I asked for testing 6.8-rc*.
Jiri, yes, I understand 😉. I tested with 6.8-rc (the one Linus maintains) and the issue could be reproduced in it. I followed the updated docs [0]. Its steps go through mainline (6.8-rc*), stable (6.7.7), and only then does it begin the bisection. The final step for validation is to revert the identified culprit commit on mainline. [0]: https://www.leemhuis.info/files/misc/How%20to%20bisect%20a%20Linux%20kernel%... -- You are receiving this mail because: You are the assignee for the bug.
https://bugzilla.suse.com/show_bug.cgi?id=1220541 https://bugzilla.suse.com/show_bug.cgi?id=1220541#c19 Pavin Joseph <me@pavinjoseph.com> changed: What |Removed |Added ---------------------------------------------------------------------------- Status|NEW |RESOLVED Resolution|--- |FIXED --- Comment #19 from Pavin Joseph <me@pavinjoseph.com> --- Kexec has been fixed in kernel 6.8.5 and LTS kernel 6.6.26. Thank you for everyone's help 😄 -- You are receiving this mail because: You are the assignee for the bug.
participants (1)
-
bugzilla_noreply@suse.com