[Bug 1018839] New: nvme device dead on cavium
http://bugzilla.suse.com/show_bug.cgi?id=1018839 Bug ID: 1018839 Summary: nvme device dead on cavium Classification: openSUSE Product: openSUSE Tumbleweed Version: Current Hardware: aarch64 OS: Other Status: NEW Severity: Normal Priority: P5 - None Component: Kernel Assignee: kernel-maintainers@forge.provo.novell.com Reporter: ro@suse.com QA Contact: qa-bugs@suse.de Found By: --- Blocker: --- happened on obs-arm-2: (none):~ # dmesg |grep nvme [ 7.471597] nvme nvme0: pci function 0005:90:00.0 [ 9.028671] nvme 0005:90:00.0: Could not set queue count (6) [ 9.034325] nvme nvme0: IO queues not created (none):~ # uname -a Linux (none) 4.8.14-1-default #1 SMP Mon Dec 12 07:58:11 UTC 2016 (ab53e9a) aarch64 aarch64 aarch64 GNU/Linux during firmware init I see: AMI NVMe BUS Driver.Start(3FFE9B3738)= Entered .... EfiSelReportStatusCode Value: 2080000 Checkpoint A0 NVMe Driver Detection and Configuration starts NVMe Driver Detection and Configuration Ends with Status Device Error Device Error BDS.RunDrivers(fe97c2a0) BDS.InitConVars(fe97d368) -- You are receiving this mail because: You are on the CC list for the bug.
http://bugzilla.suse.com/show_bug.cgi?id=1018839
Ruediger Oertel
http://bugzilla.suse.com/show_bug.cgi?id=1018839
http://bugzilla.suse.com/show_bug.cgi?id=1018839#c1
Johannes Thumshirn
http://bugzilla.suse.com/show_bug.cgi?id=1018839
Alexander Graf
http://bugzilla.suse.com/show_bug.cgi?id=1018839
http://bugzilla.suse.com/show_bug.cgi?id=1018839#c2
--- Comment #2 from Ruediger Oertel
http://bugzilla.suse.com/show_bug.cgi?id=1018839
http://bugzilla.suse.com/show_bug.cgi?id=1018839#c3
--- Comment #3 from Ruediger Oertel
http://bugzilla.suse.com/show_bug.cgi?id=1018839
http://bugzilla.suse.com/show_bug.cgi?id=1018839#c4
--- Comment #4 from Johannes Thumshirn
http://bugzilla.suse.com/show_bug.cgi?id=1018839
http://bugzilla.suse.com/show_bug.cgi?id=1018839#c5
--- Comment #5 from Johannes Thumshirn
http://bugzilla.suse.com/show_bug.cgi?id=1018839
http://bugzilla.suse.com/show_bug.cgi?id=1018839#c6
--- Comment #6 from Johannes Thumshirn
http://bugzilla.suse.com/show_bug.cgi?id=1018839
http://bugzilla.suse.com/show_bug.cgi?id=1018839#c7
--- Comment #7 from Keith Busch
http://bugzilla.suse.com/show_bug.cgi?id=1018839
Johannes Thumshirn
http://bugzilla.suse.com/show_bug.cgi?id=1018839
http://bugzilla.suse.com/show_bug.cgi?id=1018839#c8
--- Comment #8 from Johannes Thumshirn
http://bugzilla.suse.com/show_bug.cgi?id=1018839
http://bugzilla.suse.com/show_bug.cgi?id=1018839#c9
--- Comment #9 from Keith Busch
http://bugzilla.suse.com/show_bug.cgi?id=1018839
http://bugzilla.suse.com/show_bug.cgi?id=1018839#c10
--- Comment #10 from Keith Busch
http://bugzilla.suse.com/show_bug.cgi?id=1018839
http://bugzilla.suse.com/show_bug.cgi?id=1018839#c11
--- Comment #11 from Alexander Graf
Okay, the ASSERT_405C786C issue is already being tracked by in the SSD firmware bug tracker for f/w revision 8DV101F0. Did you receive the drives with that firmware, or did you upgrade to this firmware revision? Were these drives ever working?
These drives were shipped with that particular firmware. We only received them a few weeks ago. -- You are receiving this mail because: You are on the CC list for the bug.
http://bugzilla.suse.com/show_bug.cgi?id=1018839
http://bugzilla.suse.com/show_bug.cgi?id=1018839#c12
--- Comment #12 from Alexander Graf
I am told this particular assert has multiple root causes, and that the next version of the maintenance release (firmware ending in 1H0) has fixes for the known reasons.
Is there any ETA? There's a really good chance we'll lose more drives soon and I'd rather not have to RMA them.
I am also told that asserted drives must be RMA'd as there is no Intel provided tool available for end-users outside Intel that can restore the drive to a working condition. Can you return these drives and request replacements using the latest maintenance release?
The latest maintenance release is still 1F0, isn't it? So the replacements would still come with faulty firmware. But yes, we'll have to RMA the known bad ones then I suppose. -- You are receiving this mail because: You are on the CC list for the bug.
http://bugzilla.suse.com/show_bug.cgi?id=1018839
http://bugzilla.suse.com/show_bug.cgi?id=1018839#c13
--- Comment #13 from Keith Busch
The latest maintenance release is still 1F0, isn't it? So the replacements would still come with faulty firmware.
There is a newer version, 1H0, that is classified as "released", but I don't see it on the Intel download site either. I'll check with SI lead on where we're providing that. -- You are receiving this mail because: You are on the CC list for the bug.
http://bugzilla.suse.com/show_bug.cgi?id=1018839
http://bugzilla.suse.com/show_bug.cgi?id=1018839#c14
--- Comment #14 from Keith Busch
http://bugzilla.suse.com/show_bug.cgi?id=1018839
Ihno Krumreich
http://bugzilla.suse.com/show_bug.cgi?id=1018839
http://bugzilla.suse.com/show_bug.cgi?id=1018839#c15
--- Comment #15 from Johannes Thumshirn
Lacking the released tools with the 1H0 firmware, I can provide a signed binary that can be loaded onto the controller in the nvme standard way. However, I don't have any tools available to clear the assert to make the drives usable again. So, if the replacement drives come with 1F0 firmware, I can at least provide a means to upgrade to the ones that have all known fixes to the assert.
Keith, can you mail us the binary or even attach it to this bug? Thanks a lot -- You are receiving this mail because: You are on the CC list for the bug.
http://bugzilla.suse.com/show_bug.cgi?id=1018839
http://bugzilla.suse.com/show_bug.cgi?id=1018839#c16
--- Comment #16 from Keith Busch
http://bugzilla.suse.com/show_bug.cgi?id=1018839
http://bugzilla.suse.com/show_bug.cgi?id=1018839#c17
Johannes Thumshirn
http://bugzilla.suse.com/show_bug.cgi?id=1018839
Ihno Krumreich
http://bugzilla.suse.com/show_bug.cgi?id=1018839
http://bugzilla.suse.com/show_bug.cgi?id=1018839#c18
Ruediger Oertel
http://bugzilla.suse.com/show_bug.cgi?id=1018839
http://bugzilla.suse.com/show_bug.cgi?id=1018839#c19
--- Comment #19 from Keith Busch
fw download seems to work but apparently there is no way to acutally activate this firmware (or the firmware is not in a format this drive accepts):
# nvme fw-download /dev/nvme0 --fw=/root/8DV101H0.bin Firmware download success ibs-arm-1:~ # nvme fw-activate /dev/nvme0 -s 0 -a 1 NVME Admin command error:FIRMWARE_IMAGE(107) ibs-arm-1:~ # nvme fw-activate /dev/nvme0 -s 0 -a 0 NVME Admin command error:INVALID_FIELD(2) ibs-arm-1:~ # nvme fw-activate /dev/nvme0 -s 0 -a 2 Success activating firmware action:2 slot:0
okay, action 2 succeeds, but that does not involve the newly uploaded/downloaded firmware.
Action 1 is required to use to get new firmware activated, and the response says it's invalid, meaning here the signing on this firmware is not compatible with the signature in your running firmware. That's what I was afraid would happen, because I can't do anything about that. The host software I develop are on the wrong side of this problem as I have no visibility into firmware building and packaging. -- You are receiving this mail because: You are on the CC list for the bug.
http://bugzilla.suse.com/show_bug.cgi?id=1018839
http://bugzilla.suse.com/show_bug.cgi?id=1018839#c20
Johannes Thumshirn
participants (1)
-
bugzilla_noreply@novell.com