nvme nvme0: frozen state error detected, reset controller

Hi, I have this annoying issue with any of the recent kernels that come with Leap 15.1 and 15.2 that my SSD becomes read-only after a while. I can reproduce the behavior almost 100% of the time pushing the same heavy I/O work load on it. I have one "golden" kernel under which issue _never_ shows up (uptime easily over 45 days and pushing the same work load multiple times a day). This "golden" kernel is: vmlinuz-4.12.14-lp151.28.10-default. Any kernel I have tried since the above during Leap 15.1 updates, and now the latest Leap 15.2 (vmlinuz-5.3.18-lp152.50-default) have the above described issue with the SSD failure and system lockup. The last log that goes to the screen is: pcieport 000:00:1d.4: DPC: unmasked uncorrectable error detected nvme nvme0: frozen state error detected, reset controller I would just stay with the "golden" kernel if it was not for some other issues (HDMI problem) that I have with /it/. The latest 5.3 kernel definitely has the HDMI issue fixed, and I'd love to move on, but cannot due to the SSD issue. The drive is an Intel model number: HBRPEKNX0202AH. It's been years since I last built my own custom kernels, and I was really hoping to not have to do that again. Please let me know if there is any additional information that would be useful to address this issue. Best, -Gerhard

Hi, On Wed, Dec 02, 2020 at 04:31:14PM -0800, Gerhard Theurich wrote:
pcieport 000:00:1d.4: DPC: unmasked uncorrectable error detected nvme nvme0: frozen state error detected, reset controller
I am not a PCI expert but from a quick glance on some documentation I'd say the PCI controller detects an error which gets the error recovery strategy of the kernel going. This results in a NVMe controller reset and the filesystem gets marked read only. So this makes all sense. The obvious question is what kind of error is detected? Anyway, there is a kernel option to disable the error detection (pci=noear). One thing you could also try is to disable active power state management, see https://www.thomas-krenn.com/de/wiki/PCIe_Bus_Error_Status_00001100_beheben (assuming you understand German :)) HTH, Daniel

Hi Daniel, I tried the pcie_aspm=off option, and it seems to work! At least I experienced no more freezing while running through the same scenarios that pretty consistently made it freeze previously. I ended up using nvme_core.default_ps_max_latency_us=0 that someone else pointed out. So far the freezes are gone with that boot option as well. Thank you! Best, -Gerhard On 12/3/20 1:25 AM, Daniel Wagner wrote:

Hi, On Wed, Dec 02, 2020 at 04:31:14PM -0800, Gerhard Theurich wrote:
pcieport 000:00:1d.4: DPC: unmasked uncorrectable error detected nvme nvme0: frozen state error detected, reset controller
I am not a PCI expert but from a quick glance on some documentation I'd say the PCI controller detects an error which gets the error recovery strategy of the kernel going. This results in a NVMe controller reset and the filesystem gets marked read only. So this makes all sense. The obvious question is what kind of error is detected? Anyway, there is a kernel option to disable the error detection (pci=noear). One thing you could also try is to disable active power state management, see https://www.thomas-krenn.com/de/wiki/PCIe_Bus_Error_Status_00001100_beheben (assuming you understand German :)) HTH, Daniel

Hi Daniel, I tried the pcie_aspm=off option, and it seems to work! At least I experienced no more freezing while running through the same scenarios that pretty consistently made it freeze previously. I ended up using nvme_core.default_ps_max_latency_us=0 that someone else pointed out. So far the freezes are gone with that boot option as well. Thank you! Best, -Gerhard On 12/3/20 1:25 AM, Daniel Wagner wrote:
participants (2)
-
Daniel Wagner
-
Gerhard Theurich