Comment # 9 on bug 1180917 from LTC BugProxy

------- Comment From geraldsc@de.ibm.com 2021-01-19 09:45 EDT-------
(In reply to comment #13)
[...]
> > Is there any other means of kernel source access for openSUSE Tumbleweed,
> > ideally a git repo like for SLES? Seems hard to believe that "open"SUSE
> > kernel source is harder to find / access than SLES code...
>
> It is not hard to find at all. All you need to know is, that the different
> flavors of kernels all depend on a central package called kernel-source,
> which has an own mechanics to integrate patches depending on a variety of
> conditions. The source can be found in the package
> http://download.opensuse.org/ports/zsystems/tumbleweed/repo/oss/noarch/
> kernel-source-5.10.5-1.1.noarch.rpm

Hmm, almost right, you also need to know that the kernel-source.noarch.rpm does
not contain the full source, and that you need the kernel-devel.noarch.rpm on
top (e.g. for arch code). This is the same for SLES, so fortunately I knew it,
but of course for SLES it is much less annoying because you also have a proper
public git for both kernel source tree and also kernel-source src.rpm content,
so you don't really need to bother about knowing which rpms contain what...

Anyway, back to the bug, from the kernel source for 5.10.5-1-default I see that
it happens on the BUG_ON(!pte_none(*pte)) in __split_huge_pmd_locked(). This is
very strange / interesting, because those are the ptes from the pre-allocated
and deposited pagetable, which was withdrawn just shortly before that BUG_ON,
with pgtable_trans_huge_withdraw().

The pre-allocated pagetables are initialized with empty (invalid) ptes before
deposit, so they should of course all (still) be pte_none() after withdrawal.
If a pte is !pte_none, then this means that either the pre-allocated pagetable
got corrupted while it was deposited, or maybe that
pgtable_trans_huge_withdraw() returns something that is not really a pagetable
at all.

E.g. in theory it could return NULL, if there were more withdrawals than
deposits, IIUC the list handling code there correctly. Of course, such a thing
should never happen (i.e. it would be a bug), but I am a bit confused why the
common code does not also check this with a BUG_ON check. Having a system dump
could help to see more of what was going on. Any chance that kdump generated a
dump after the BUG_ON?

From the backtrace and register output, and a kernel disassembly, one can at
least see that in %r1 we have the pte value that did not pass the !pte_none
check: 00000000b2b40215. This actually looks like a valid pte, with present /
young / read / write-protect set, so one could assume that this is not the
"NULL returnend" case, but rather really a pre-allocated pagetable, which
somehow got corrupted by someone having it in active access and filling it with
valid ptes.

Of course, such a thing should also never happen, the pre-allocated and
deposited pagetables can not be used until they are withdrawn, very strange. We
do actually have an own implementation of the deposit/withdraw functions,
because we cannot use the generic versions. On s390, we have 2K pagetables, and
two of them within one 4K page, so we cannot use the generic logic that
operates on struct pages for (4K) pagetable pages. The pgtable_t is therefore
also not a struct page on s390, but rather a direct pointer to the pagetable.
For maintainig the list of pre-allocatced pagetables, we put a list_head
directly into the pageteables, at the beginning, instead of using page->lru of
the struct page associated with the pagetable like it is donr in the generic
case. Then, on withdraw, and after list_del, the first two ptes will be cleared
so that the list_head gets overwritten and the whole pagetable should be empty
again.

That is at least suspicious, and it could could explain why you only see this
on s390 (do you really?). However, I do not really see how our implementation
would change anything that allows the deposited pagetables to change before
withdrawing them. It is really the same logic as in generic code, only that we
put the list_head somewhere else. I still suspect some race in common code,
e.g. some concurrent withdrawals w/o proper locking, but I could not yet find
anything suspicious in the code...