------- Comment From geraldsc@de.ibm.com 2021-06-16 10:33 EDT------- (In reply to comment #19) > the STATEFILE.gz is a qemu migrate-to-file so it's a full dump of the state > of the VM including memory. > > actually looking at a few of these stuck VMs the first BUG message printed > is not even a "Bad page state" but always seems to be "Bad rss-counter state" > > llvm11 build: > [ 3049s] [ 3032.903705] BUG: Bad rss-counter state mm:0000000077185176 > type:MM_FILEPAGES val:-432 > [11742s] [11725.874403] BUG: Bad page state in process clang++ pfn:1e0a01 > [11742s] [11725.874961] BUG: Bad page state in process clang++ pfn:1e0a02 > [11742s] [11725.876569] BUG: Bad page state in process clang++ pfn:1e0a03 > [11742s] [11725.877561] BUG: Bad page state in process clang++ pfn:1e0a04 > > mongodb build: > [ 229s] [ 218.670108] BUG: Bad rss-counter state mm:000000004cfc260c > type:MM_FILEPAGES val:-256 > [ 315s] [ 304.902152] BUG: Bad page state in process cc1plus pfn:148101 > [ 315s] [ 304.903082] BUG: Bad page state in process cc1plus pfn:148102 > [ 315s] [ 304.904555] BUG: Bad page state in process cc1plus pfn:148103 > [ 315s] [ 304.905273] BUG: Bad page state in process cc1plus pfn:148104 > [ 315s] [ 304.906991] BUG: Bad page state in process cc1plus pfn:148105 > > cross-arm-none-gcc11-bootstrap: > [ 320s] [ 281.035581] BUG: Bad rss-counter state mm:00000000a12d99da > type:MM_FILEPAGES val:-256 > [ 1113s] [ 1074.149241] BUG: Bad page state in process cc1plus pfn:13ff01 > [ 1113s] [ 1074.150288] BUG: Bad page state in process cc1plus pfn:13ff02 > [ 1113s] [ 1074.150718] BUG: Bad page state in process cc1plus pfn:13ff03 > [ 1113s] [ 1074.150884] BUG: Bad page state in process cc1plus pfn:13ff04 > Again, those messages do not show the complete picture, probably because you only collect certain loglevels. A simple dmesg output, containing _all_ the kernel messages would be the very least, for all kind of kernel problems that you want to report. A (proper) dump might also not give more information than this, if it was not collected at the exact time where the problem occurred, which is of course not possible if the kernel does not panic at this time. > # A proper dump would contain a vmcore plus a vmlinux with debuginfo. > > that would be kdump, but there is no easy way to set this up for our on the > fly created VMs. The dump that I created was done with qemu migrating the > stuck VM to a file. OK, could you please give some instruction on how I would be able to open this with crash for analysis? To my knowledge, I would certainly always need at least the vmlinux with debuginfo, which is missing (or is it?), but I never had to look into qemu dumps. Or, just forget about that dump, which would be after-the-facts anyway, and simply provide the dmesg output? BTW, is this easily reproducible, and can you verify if it also occurs with THP disabled?