Comment # 18 on bug 1187264 from LTC BugProxy

------- Comment From geraldsc@de.ibm.com 2021-06-16 10:33 EDT-------
(In reply to comment #19)
> the STATEFILE.gz is a qemu migrate-to-file so it's a full dump of the state
> of the VM including memory.
>
> actually looking at a few of these stuck VMs the first BUG message printed
> is not even a "Bad page state" but always seems to be "Bad rss-counter state"
>
> llvm11 build:
> [ 3049s] [ 3032.903705] BUG: Bad rss-counter state mm:0000000077185176
> type:MM_FILEPAGES val:-432
> [11742s] [11725.874403] BUG: Bad page state in process clang++  pfn:1e0a01
> [11742s] [11725.874961] BUG: Bad page state in process clang++  pfn:1e0a02
> [11742s] [11725.876569] BUG: Bad page state in process clang++  pfn:1e0a03
> [11742s] [11725.877561] BUG: Bad page state in process clang++  pfn:1e0a04
>
> mongodb build:
> [  229s] [  218.670108] BUG: Bad rss-counter state mm:000000004cfc260c
> type:MM_FILEPAGES val:-256
> [  315s] [  304.902152] BUG: Bad page state in process cc1plus  pfn:148101
> [  315s] [  304.903082] BUG: Bad page state in process cc1plus  pfn:148102
> [  315s] [  304.904555] BUG: Bad page state in process cc1plus  pfn:148103
> [  315s] [  304.905273] BUG: Bad page state in process cc1plus  pfn:148104
> [  315s] [  304.906991] BUG: Bad page state in process cc1plus  pfn:148105
>
> cross-arm-none-gcc11-bootstrap:
> [  320s] [  281.035581] BUG: Bad rss-counter state mm:00000000a12d99da
> type:MM_FILEPAGES val:-256
> [ 1113s] [ 1074.149241] BUG: Bad page state in process cc1plus  pfn:13ff01
> [ 1113s] [ 1074.150288] BUG: Bad page state in process cc1plus  pfn:13ff02
> [ 1113s] [ 1074.150718] BUG: Bad page state in process cc1plus  pfn:13ff03
> [ 1113s] [ 1074.150884] BUG: Bad page state in process cc1plus  pfn:13ff04
>

Again, those messages do not show the complete picture, probably because you
only collect certain loglevels. A simple dmesg output, containing _all_ the
kernel messages would be the very least, for all kind of kernel problems that
you want to report.

A (proper) dump might also not give more information than this, if it was not
collected at the exact time where the problem occurred, which is of course not
possible if the kernel does not panic at this time.

> # A proper dump would contain a vmcore plus a vmlinux with debuginfo.
>
> that would be kdump, but there is no easy way to set this up for our on the
> fly created VMs. The dump that I created was done with qemu migrating the
> stuck VM to a file.

OK, could you please give some instruction on how I would be able to open this
with crash for analysis? To my knowledge, I would certainly always need at
least the vmlinux with debuginfo, which is missing (or is it?), but I never had
to look into qemu dumps.

Or, just forget about that dump, which would be after-the-facts anyway, and
simply provide the dmesg output?

BTW, is this easily reproducible, and can you verify if it also occurs with THP
disabled?