[Bug 1188896] "BUG: Bad rss-counter state mm:00000000e555b579 type:MM_FILEPAGES val:-256" during build of cross-epiphany-gcc11-bootstrap
https://bugzilla.suse.com/show_bug.cgi?id=1188896
https://bugzilla.suse.com/show_bug.cgi?id=1188896#c19
--- Comment #19 from LTC BugProxy
FYI, Vlastimil found an interesting potential race issue with s390 pagetable handling, see LTC bug#177879 / SUSE bug#1136513.
This could well explain some long standing issues that we see from time to time, including the THP issues you previously found with openSUSE, and also this one here.
BTW, a kernel dump would probably not be of much help here, as it is again no panic, where we would get a dump directly at the point where it happens, but rather we would only get a dump after the facts.
So, for now, proper kernel dmesg output including all the messages from the start, would be enough. I strongly suspect some relation to the pagetable race, which can have all sorts of strange impacts.
Regarding NEEDINFO, I honestly have no clue where to start here. So far, we have not even seen a complete kernel message log. There is some output from some dump, but that appears to be long after the facts, where the system seems to be hanging in "rcu_sched detected stalls" loop, so that the message buffer already overflowed and no messages from the beginning of the problem remain. I also see no immediate relation to s390 here, so it might be worth having a common code expert from SUSE take a look, who might also have easier access to a dump. However, since there is no panic involved, any dump taken after-the-facts might not really give any useful information (apart from the complete kernel message log maybe, if it did not overflow yet). I can only grasp at straws here, which is my comment above, regarding the potential race in s390 pagetable handling code. In the referenced bugzilla, it eventually turned out to be a different issue, missing commit fc8efd2ddfed3 ("mm/memory.c: do_fault: avoid usage of stale vm_area_struct"), and not the the theoretical race that Vlastimil found. That commit is already included in your openSUSE kernel version. However, that race should still exist, and it could explain all sorts of strange issues. But it is/was present since like forever, and never showed anywhere so far, so I actually do have very little hope that it would be related here, yet it is a straw to grasp. We are currently implementing / reviewing a patch for that, and it will have a stable tag, so that it would end up in openSUSE eventually. I will also update here when it is upstream, but that might take some time because we'd like to give it thorough testing first, as it touches delicate code. I could also attach it here before it is upstream, if it would be an option to add it to your openSUSE kernel builds for testing. -- You are receiving this mail because: You are the assignee for the bug.
participants (1)
-
bugzilla_noreply@suse.com