[Bug 1165535] New: System lockup under memory pressure using encrypted swap on SSD (failed command: WRITE FPDMA QUEUED)
http://bugzilla.opensuse.org/show_bug.cgi?id=1165535 Bug ID: 1165535 Summary: System lockup under memory pressure using encrypted swap on SSD (failed command: WRITE FPDMA QUEUED) Classification: openSUSE Product: openSUSE Distribution Version: Leap 15.1 Hardware: Other OS: Other Status: NEW Severity: Major Priority: P5 - None Component: Kernel Assignee: kernel-maintainers@forge.provo.novell.com Reporter: Ulrich.Windl@rz.uni-regensburg.de QA Contact: qa-bugs@suse.de Found By: --- Blocker: --- On a system that uses an SSD for /boot and encrypted swap only (the rest is on a traditional hard disk), I had the following problem: I was working normally with the system, when I decided to launch "yast2 online_update". Unfortunately the system became terribly slow (practically not responding for minutes), so I could not do anything than press the Reset-button after having waited about 30 minutes. After reboot I see "odd" ATA errors related to the SSD. However the SSD is rather new and in good condition. So I guess the issue has to do with memory pressure. Here are the logs when things started to become slow: Mar 03 10:09:57 pc dbus-daemon[2706]: [session uid=1000 pid=2706] Successfully activated service 'org.gnome.evince.Daemon' Mar 03 10:09:57 pc systemd[2686]: Starting Evince document viewer... Mar 03 10:09:57 pc systemd[2686]: Started Evince document viewer. Mar 03 10:10:13 pc kernel: ata2: log page 10h reported inactive tag 5 Mar 03 10:10:14 pc kernel: ata2.00: exception Emask 0x1 SAct 0x7fffffdf SErr 0x0 action 0x0 Mar 03 10:10:14 pc kernel: ata2.00: irq_stat 0x40000009 Mar 03 10:10:14 pc kernel: ata2.00: failed command: WRITE FPDMA QUEUED Mar 03 10:10:14 pc kernel: ata2.00: cmd 61/08:00:e0:b0:7d/00:00:00:00:00/40 tag 0 ncq dma 4096 out res 40/00:00:e0:b0:7d/00:00:00:00:00/40 Emask 0x1 (device error) Mar 03 10:10:14 pc kernel: ata2.00: status: { DRDY } Mar 03 10:10:14 pc kernel: ata2.00: failed command: WRITE FPDMA QUEUED Mar 03 10:10:14 pc kernel: ata2.00: cmd 61/08:08:e8:b0:7d/00:00:00:00:00/40 tag 1 ncq dma 4096 out res 40/00:00:e0:b0:7d/00:00:00:00:00/40 Emask 0x1 (device error) Mar 03 10:10:14 pc kernel: ata2.00: status: { DRDY } Mar 03 10:10:14 pc kernel: ata2.00: failed command: WRITE FPDMA QUEUED Mar 03 10:10:14 pc kernel: ata2.00: cmd 61/08:10:f0:b0:7d/00:00:00:00:00/40 tag 2 ncq dma 4096 out res 40/00:00:e0:b0:7d/00:00:00:00:00/40 Emask 0x1 (device error) Mar 03 10:10:14 pc kernel: ata2.00: status: { DRDY } [...] Mar 03 10:10:14 pc kernel: ata2.00: failed command: WRITE FPDMA QUEUED Mar 03 10:10:14 pc kernel: ata2.00: cmd 61/08:f0:d8:b0:7d/00:00:00:00:00/40 tag 30 ncq dma 4096 out res 40/00:00:e0:b0:7d/00:00:00:00:00/40 Emask 0x1 (device error) Mar 03 10:10:14 pc kernel: ata2.00: status: { DRDY } Mar 03 10:10:14 pc kernel: ata2.00: supports DRM functions and may not be fully accessible Mar 03 10:10:14 pc kernel: ata2.00: supports DRM functions and may not be fully accessible Mar 03 10:10:14 pc kernel: ata2.00: configured for UDMA/133 Mar 03 10:10:14 pc kernel: ata2: EH complete Mar 03 10:10:14 pc kernel: ata2.00: Enabling discard_zeroes_data Mar 03 10:10:14 pc rsyslogd[1332]: message repeated 13 times: [-- MARK --] Mar 03 10:10:14 pc rsyslogd[1332]: action 'action 1' suspended (module 'builtin:ompipe'), retry 0. There should be messages before this one giving the reason for suspension. [v8.33.1 try http://www.rsyslog.com/e/2007 ] Mar 03 10:10:14 pc rsyslogd[1332]: action 'action 1' suspended (module 'builtin:ompipe'), next retry is Tue Mar 3 10:10:44 2020, retry nbr 0. There should be messages before this one giving the reason for suspension. [v8.33.1 try http://www.rsyslog.com/e/2007 ] [...] Mar 03 10:15:48 pc /usr/lib/gdm/gdm-x-session[2700]: (II) event2 - Logitech USB-PS/2 Optical Mouse: SYN_DROPPED event - some input events have been lost. Mar 03 10:16:17 pc /usr/lib/gdm/gdm-x-session[2700]: (II) event2 - Logitech USB-PS/2 Optical Mouse: SYN_DROPPED event - some input events have been lost. [...] Mar 03 10:25:13 pc systemd-journald[30532]: Journal started Mar 03 10:25:18 pc systemd-journald[30532]: System journal (/var/log/journal/89c660865c00403a9bacef32b6828556) is 856.0M, max 819.2M, 0B free. [...] Mar 03 10:25:19 pc kernel: ata2.00: exception Emask 0x50 SAct 0x7f000007 SErr 0x800 action 0x6 frozen Mar 03 10:25:19 pc kernel: ata2.00: irq_stat 0x08000000, interface fatal error Mar 03 10:25:19 pc kernel: ata2: SError: { HostInt } Mar 03 10:25:19 pc kernel: ata2.00: failed command: WRITE FPDMA QUEUED Mar 03 10:25:19 pc kernel: ata2.00: cmd 61/c8:00:38:06:78/01:00:00:00:00/40 tag 0 ncq dma 233472 out res 40/00:c0:00:00:72/00:00:00:00:00/40 Emask 0x50 (ATA bus error) Mar 03 10:25:19 pc kernel: ata2.00: status: { DRDY } [...] Mar 03 10:25:19 pc kernel: ata2.00: failed command: WRITE FPDMA QUEUED Mar 03 10:25:19 pc kernel: ata2.00: cmd 61/40:f0:f8:00:78/05:00:00:00:00/40 tag 30 ncq dma 688128 out res 40/00:c0:00:00:72/00:00:00:00:00/40 Emask 0x50 (ATA bus error) Mar 03 10:25:19 pc kernel: ata2.00: status: { DRDY } Mar 03 10:25:19 pc kernel: ata2: hard resetting link Mar 03 10:25:19 pc kernel: ata2: SATA link up 3.0 Gbps (SStatus 123 SControl 300) Mar 03 10:25:19 pc kernel: ata2.00: supports DRM functions and may not be fully accessible Mar 03 10:25:19 pc kernel: ata2.00: supports DRM functions and may not be fully accessible Mar 03 10:25:19 pc kernel: ata2.00: configured for UDMA/133 Mar 03 10:25:19 pc kernel: ata2: EH complete Mar 03 10:25:19 pc kernel: ata2.00: Enabling discard_zeroes_data [...] Mar 03 10:25:19 pc systemd-coredump[30527]: Process 28448 (systemd-journal) of user 0 dumped core. Mar 03 10:25:19 pc systemd-coredump[30527]: Coredump diverted to /var/lib/systemd/coredump/core.systemd-journal.0.92bd8e3e91344ef998c2c08ade695cd2.28448.1583227313000000.lz4 Mar 03 10:25:19 pc systemd-coredump[30527]: Stack trace of thread 28448: Mar 03 10:25:19 pc systemd-coredump[30527]: #0 0x00007f52360114f5 journal_file_move_to_object (libsystemd-shared-234.so) Mar 03 10:25:19 pc systemd-coredump[30527]: #1 0x00007f5236012f00 n/a (libsystemd-shared-234.so) Mar 03 10:25:19 pc systemd-coredump[30527]: #2 0x00007f5236013548 n/a (libsystemd-shared-234.so) Mar 03 10:25:19 pc systemd-coredump[30527]: #3 0x00007f523601470d journal_file_append_entry (libsystemd-shared-234.so) [...] Mar 03 10:25:54 pc systemd-journald[30532]: Journal stopped Mar 03 10:26:38 pc systemd-journald[30532]: Received SIGTERM from PID 1 (systemd). Mar 03 10:26:38 pc systemd-journald[30545]: Journal started Mar 03 10:26:38 pc systemd-journald[30545]: System journal (/var/log/journal/89c660865c00403a9bacef32b6828556) is 856.0M, max 819.2M, 0B free. [...]
From this point on I did not see any more ATA errors, but the system was still unusable, so I hit Reset:
Mar 03 10:55:58 pc kernel: SFW2-INext-DROP-DEFLT IN=eth0 OUT= MAC=00:22:11:dd:00:11:a0:36:9f:b2:3f:25:08:00 SRC=1.9.154.170 1.9.5.2 LEN=44 TOS=0x00 PREC=0x00 TTL=5 6 ID=52696 PROTO=TCP SPT=47944 DPT=655 WINDOW=1024 RES=0x00 SYN URGP=0 OPT (020405B4) -- Reboot -- Mar 03 11:05:24 linux-abcd kernel: Linux version 4.12.14-lp151.28.36-default (geeko@buildhost) (gcc version 7.4.1 20190905 [gcc-7-branch revision 275407] (SUSE Linux) ) #1 SMP Fri Dec 6 13:50:27 UTC 2019 (8f4a495) Final note: The IP address and related identifyable information in the log has been anonymized for privacy. -- You are receiving this mail because: You are on the CC list for the bug.
http://bugzilla.opensuse.org/show_bug.cgi?id=1165535
http://bugzilla.opensuse.org/show_bug.cgi?id=1165535#c1
Hannes Reinecke
http://bugzilla.opensuse.org/show_bug.cgi?id=1165535
http://bugzilla.opensuse.org/show_bug.cgi?id=1165535#c2
Ulrich Windl
Can you give some more details on how you configured 'encrypted swap' ? dm-crypt? Hardware encryption?
It's a LUKS volume via /etc/crypttab: cr_swap /dev/disk/by-id/ata-Samsung_SSD_860_EVO_500GB_S4YAMG8N878778W-part2 none discard,swap
Would it be possible use 'normal' swap and check if the error persists?
I think it's hard enough to reproduce the memory pressure (it that was the trigger at all). Maybe even the encrypted swap does not matter; maybe the discard option matters; I don't really know. Also I had been using this setup for a long time. Only a few weeks ago I replaced a dying harddisk with the SSD, adjusting the setup. So far it was the first time that this problem had occurred. (the advantage of the old and noisy Seagate disk was that you could clearly hear when the system started paging, explaining the slowness of the system ;-) With SSD you hardly experience a major slowdown when the system begins paging) Maybe "ata2: log page 10h reported inactive tag 5" indicates that something's wrong in the kernel. That was the message after which everything began. I can exclude memory errors, as the system uses an old AMD Phenom II CPU with ECC RAM that is absolutely "rock solid stable". -- You are receiving this mail because: You are on the CC list for the bug.
http://bugzilla.opensuse.org/show_bug.cgi?id=1165535
http://bugzilla.opensuse.org/show_bug.cgi?id=1165535#c4
--- Comment #4 from Ulrich Windl
http://bugzilla.opensuse.org/show_bug.cgi?id=1165535
http://bugzilla.opensuse.org/show_bug.cgi?id=1165535#c9
--- Comment #9 from Ulrich Windl
@Dear Ulrich: could you please see your system, specifically what controller there is? Perhaps we are hitting the same issue here?
(from lspci)
00:11.0 SATA controller: Advanced Micro Devices, Inc. [AMD/ATI]
SB7x0/SB8x0/SB9x0 SATA Controller [IDE mode]
00:14.1 IDE interface: Advanced Micro Devices, Inc. [AMD/ATI] SB7x0/SB8x0/SB9x0
IDE Controller
# lspci -nn -vv -s 0:14.1
00:14.1 IDE interface [0101]: Advanced Micro Devices, Inc. [AMD/ATI]
SB7x0/SB8x0/SB9x0 IDE Controller [1002:439c] (prog-if 8a [ISA Compatibility
mode controller, supports both channels switched to PCI native mode, supports
bus mastering])
Subsystem: Gigabyte Technology Co., Ltd Device [1458:5002]
Control: I/O+ Mem- BusMaster+ SpecCycle- MemWINV- VGASnoop- ParErr-
Stepping- SERR- FastB2B- DisINTx-
Status: Cap+ 66MHz+ UDF- FastB2B- ParErr- DEVSEL=medium >TAbort- Since the last comment, have you observed the same problem again? No. Maybe it was fixed in the kernel.
--
You are receiving this mail because:
You are on the CC list for the bug.
http://bugzilla.opensuse.org/show_bug.cgi?id=1165535
Ulrich Windl
http://bugzilla.opensuse.org/show_bug.cgi?id=1165535
http://bugzilla.opensuse.org/show_bug.cgi?id=1165535#c10
--- Comment #10 from Ulrich Windl
http://bugzilla.opensuse.org/show_bug.cgi?id=1165535
http://bugzilla.opensuse.org/show_bug.cgi?id=1165535#c12
--- Comment #12 from Ulrich Windl
Ulrich, Leap 15.2 was released meanwhile with 5.3 kernel, so things may change with that.
Actually I have upgraded from Leap 15.1 to 15.2 at the beginning of this month. -- You are receiving this mail because: You are on the CC list for the bug.
http://bugzilla.opensuse.org/show_bug.cgi?id=1165535
http://bugzilla.opensuse.org/show_bug.cgi?id=1165535#c13
--- Comment #13 from Ulrich Windl
Frankly, I am not sure what to do here. It would be too easy to blame it on HW incompatibility, which Sebastian mentioned.
Counter-argument: On a completely different and new hardware (Machine with four Xeon Gold (72 cores, 144 threads, > 700GB RAM) I had a kernel "freeze" while performing some memory load test that just uses a few processes that shuffle process-local memory around. The machine in question only had SSDs, configured as two RAIDs via PERC: RAID1 for base OS, and RAID6 for additional data. When performing the test without having any swap configured, the kernel started to use the OOM killer to terminate a few processes. After having configured 5GB swap @ prio=4 and >700GB swap @prio=2 the kernel froze shortly after having used swap. The special thing about the new setup was that the bug swap was configured via LVM and a thin volume. The kernel in question was 5.3.18-24.15-default (actually from SLES15 SP2), and there was plenty of swap space available when this had happened. Interestingly the kernel still responded to PING, but that was all: No VT-switching, not console getty, no ssh connection, no reaction on existing SSH connections. However I had a "top" with sleep time 1s running at the moment when the kernel froze. I gave it more than 10 minutes to recover, but nothing had happened and there was no message in /var/log/messages after the date top last had refreshed. -- You are receiving this mail because: You are on the CC list for the bug.
http://bugzilla.opensuse.org/show_bug.cgi?id=1165535
http://bugzilla.opensuse.org/show_bug.cgi?id=1165535#c14
--- Comment #14 from Ulrich Windl
http://bugzilla.opensuse.org/show_bug.cgi?id=1165535
http://bugzilla.opensuse.org/show_bug.cgi?id=1165535#c15
Nikolai Nikolaevskii
participants (2)
-
bugzilla_noreply@novell.com
-
bugzilla_noreply@suse.com