[Bug 959230] New: Ralink 5370 wifi connection stalls
http://bugzilla.opensuse.org/show_bug.cgi?id=959230 Bug ID: 959230 Summary: Ralink 5370 wifi connection stalls Classification: openSUSE Product: openSUSE Distribution Version: Leap 42.1 Hardware: x86-64 OS: openSUSE 42.1 Status: NEW Severity: Normal Priority: P5 - None Component: Kernel Assignee: kernel-maintainers@forge.provo.novell.com Reporter: toomas.aas@raad.tartu.ee QA Contact: qa-bugs@suse.de Found By: --- Blocker: --- I use an USB wifi adapter based on Ralink RT5370 chip: $ lsusb Bus 001 Device 002: ID 148f:5370 Ralink Technology, Corp. RT5370 Wireless Adapter Immediately after boot, network works fine, but after some time all network communication ceases and an endless stream of messages like this keeps filling dmesg: kernel: ieee80211 phy0: rt2x00usb_vendor_request: Error - Vendor Request 0x07 failed for offset 0x7010 with error -110 kernel: ieee80211 phy0: rt2x00usb_vendor_request: Error - Vendor Request 0x06 failed for offset 0x7010 with error -110 kernel: ieee80211 phy0: rt2x00usb_vendor_request: Error - Vendor Request 0x06 failed for offset 0x0404 with error -110 After rebooting, I can again use the network for some time, but soon the problem reappears. Interesting thing is that this only happens if the system has 10 GB RAM. My system has 2x1GB plus 2x4GB DIMMs, and when I remove one of the 4 GB DIMMs (thus reducing the amount of memory to 6 GB) the problem does not appear. To rule out the possibility of bad DIMM I have run memtest86 for several hours without finding any problems, also the system is rock solid with 10 GB RAM when I remove the USB wireless adapter and use wired Ethernet. Version information: $ uname -a Linux susa 4.1.13-5-default #1 SMP PREEMPT Thu Nov 26 16:35:17 UTC 2015 (49475c3) x86_64 x86_64 x86_64 GNU/Linux $ /usr/sbin/modinfo rt2x00usb filename: /lib/modules/4.1.13-5-default/kernel/drivers/net/wireless/rt2x00/rt2x00usb.ko license: GPL description: rt2x00 usb library version: 2.3.0 author: http://rt2x00.serialmonkey.com srcversion: 3F336708588FABE8A5041C1 depends: usbcore,rt2x00lib,mac80211 intree: Y vermagic: 4.1.13-5-default SMP preempt mod_unload modversions signer: openSUSE Secure Boot Signkey sig_key: 03:32:FA:9C:BF:0D:88:BF:21:92:4B:0D:E8:2A:09:A5:4D:5D:EF:C8 sig_hashalgo: sha256 -- You are receiving this mail because: You are on the CC list for the bug.
http://bugzilla.opensuse.org/show_bug.cgi?id=959230
http://bugzilla.opensuse.org/show_bug.cgi?id=959230#c1
Takashi Iwai
kernel: ieee80211 phy0: rt2x00usb_vendor_request: Error - Vendor Request 0x07 failed for offset 0x7010 with error -110 kernel: ieee80211 phy0: rt2x00usb_vendor_request: Error - Vendor Request 0x06 failed for offset 0x7010 with error -110 kernel: ieee80211 phy0: rt2x00usb_vendor_request: Error - Vendor Request 0x06 failed for offset 0x0404 with error -110
The error itself looks common with this chip, hitting many reports on Web.
After rebooting, I can again use the network for some time, but soon the problem reappears.
Interesting thing is that this only happens if the system has 10 GB RAM. My system has 2x1GB plus 2x4GB DIMMs, and when I remove one of the 4 GB DIMMs (thus reducing the amount of memory to 6 GB) the problem does not appear. To rule out the possibility of bad DIMM I have run memtest86 for several hours without finding any problems, also the system is rock solid with 10 GB RAM when I remove the USB wireless adapter and use wired Ethernet.
So this implies that the host side influences on the problem. Oliver, any clue? Some DMA issue? -- You are receiving this mail because: You are on the CC list for the bug.
http://bugzilla.opensuse.org/show_bug.cgi?id=959230
http://bugzilla.opensuse.org/show_bug.cgi?id=959230#c2
Oliver Neukum
(In reply to Toomas Aas from comment #0)
Interesting thing is that this only happens if the system has 10 GB RAM. My system has 2x1GB plus 2x4GB DIMMs, and when I remove one of the 4 GB DIMMs (thus reducing the amount of memory to 6 GB) the problem does not appear. To rule out the possibility of bad DIMM I have run memtest86 for several hours without finding any problems, also the system is rock solid with 10 GB RAM when I remove the USB wireless adapter and use wired Ethernet.
So this implies that the host side influences on the problem. Oliver, any clue? Some DMA issue?
It does look like DMA. We need to determine which HC is used. Please provide the output of "lsusb -t" -- You are receiving this mail because: You are on the CC list for the bug.
http://bugzilla.opensuse.org/show_bug.cgi?id=959230
http://bugzilla.opensuse.org/show_bug.cgi?id=959230#c3
--- Comment #3 from Toomas Aas
http://bugzilla.opensuse.org/show_bug.cgi?id=959230
http://bugzilla.opensuse.org/show_bug.cgi?id=959230#c8
--- Comment #8 from Toomas Aas
http://bugzilla.opensuse.org/show_bug.cgi?id=959230
http://bugzilla.opensuse.org/show_bug.cgi?id=959230#c10
Toomas Aas
http://bugzilla.opensuse.org/show_bug.cgi?id=959230
http://bugzilla.opensuse.org/show_bug.cgi?id=959230#c11
Joerg Roedel
Booting with kernel parameter 'iommu=soft' did not alleviate the problem.
Thanks for verifying this, Toomas. So the problem also shows up when no IOMMU is in use at all. Oliver, any other ideas? -- You are receiving this mail because: You are on the CC list for the bug.
http://bugzilla.opensuse.org/show_bug.cgi?id=959230
http://bugzilla.opensuse.org/show_bug.cgi?id=959230#c12
Oliver Neukum
(In reply to Toomas Aas from comment #10)
Booting with kernel parameter 'iommu=soft' did not alleviate the problem.
Thanks for verifying this, Toomas. So the problem also shows up when no IOMMU is in use at all.
Oliver, any other ideas?
The EHCI driver does not touch the DMA mask. I will make a patch to force it to the low 4GB. -- You are receiving this mail because: You are on the CC list for the bug.
http://bugzilla.opensuse.org/show_bug.cgi?id=959230
http://bugzilla.opensuse.org/show_bug.cgi?id=959230#c13
Oliver Neukum
http://bugzilla.opensuse.org/show_bug.cgi?id=959230
http://bugzilla.opensuse.org/show_bug.cgi?id=959230#c14
Toomas Aas
http://bugzilla.opensuse.org/show_bug.cgi?id=959230
http://bugzilla.opensuse.org/show_bug.cgi?id=959230#c15
Oliver Neukum
http://bugzilla.opensuse.org/show_bug.cgi?id=959230
http://bugzilla.opensuse.org/show_bug.cgi?id=959230#c16
--- Comment #16 from Toomas Aas
http://bugzilla.opensuse.org/show_bug.cgi?id=959230
http://bugzilla.opensuse.org/show_bug.cgi?id=959230#c17
Toomas Aas
http://bugzilla.opensuse.org/show_bug.cgi?id=959230
http://bugzilla.opensuse.org/show_bug.cgi?id=959230#c23
Toomas Aas
http://bugzilla.opensuse.org/show_bug.cgi?id=959230
http://bugzilla.opensuse.org/show_bug.cgi?id=959230#c24
Oliver Neukum
Created attachment 698959 [details] dmesg output with debugging enabled
Here is the dmesg output that I captured today. Unfortunately, by the time I noticed that the problem had appeared, very first error messages were no longer present in dmesg buffer and I couldn't retrieve them. Hopefully this is still helpful.
According to the logs scanning fails. Please try the kernel to be found at http://kernel.suse.com/packages/vanilla -- You are receiving this mail because: You are on the CC list for the bug.
http://bugzilla.opensuse.org/show_bug.cgi?id=959230
http://bugzilla.opensuse.org/show_bug.cgi?id=959230#c31
--- Comment #31 from Toomas Aas
http://bugzilla.opensuse.org/show_bug.cgi?id=959230
http://bugzilla.opensuse.org/show_bug.cgi?id=959230#c32
Toomas Aas
http://bugzilla.opensuse.org/show_bug.cgi?id=959230
Toomas Aas
http://bugzilla.opensuse.org/show_bug.cgi?id=959230
http://bugzilla.opensuse.org/show_bug.cgi?id=959230#c33
Oliver Neukum
http://bugzilla.opensuse.org/show_bug.cgi?id=959230
http://bugzilla.opensuse.org/show_bug.cgi?id=959230#c34
--- Comment #34 from Takashi Iwai
Created attachment 693344 [details] limit DMA to 4GB
I think the patch should use dma_set_mask_and_coherent() instead. Otherwise the coherent mask will be left as is. -- You are receiving this mail because: You are on the CC list for the bug.
http://bugzilla.opensuse.org/show_bug.cgi?id=959230
http://bugzilla.opensuse.org/show_bug.cgi?id=959230#c35
--- Comment #35 from Mel Gorman
Mel, it looks like we are seeing an issue with MM acting up on machines with a lot of RAM in 42.3. I don't suppose we can take a patch that alters MM so fundamentally. Any possible work arounds? Creative gfp flags?
Note that I'm not very active in MM at the moment due to other responsibilities. However, I think the fact that it bisects to this particular commit may be a partial co-incidence. The nature of the error messages appear to be due to a failure when calling usb_control_msg has an error when called from the driver. usb_control_msg calls kmalloc(sizeof(struct usb_ctrlrequest), GFP_NOIO) so what is likely to be happening is that the kmalloc fails but it's not able to reclaim many pages due to the GFP context. The commit in question has a side-effect of causing kswapd to wake to reclaim from zones earlier than it does without the patch. Specifically, without the patch, zones are evenly used until kswapd is required. With the patch, the Normal zone in this case fills first, wakes kswapd, uses lower zones etc. As a side-effect, the early reclaim means memory is freed earlier and a GFP_NOIO call is more likely to successfully complete on an x86 system. Altering the amount of memory so that there is a normal zone could co-incidentally alter the timing of when kswapd wakes which may be why it's visible. Applying the patch is not without consequences. The patch does not work in isolation, it only works properly if all the dependent patches are included that move all the LRU lists to the node and that is a non-trivial backport that would cause KABI issues if it was included which forces it to be 42.3-specific. If the patch is included on its own, it'll introduce page age inversion issues and so introduce regressions that are easy to detect but very difficult to isolate as the root cause. Any tuning option from the MM side requires other patches. However, increasing min_free_kbytes *may* mitigate the problem so it is worth trying but it may also just delay the problem for longer. The other option is a 42.3-specific hack that should not be forward-ported that specifies __GFP_HIGH in usb_control_msg as a failure to allocate there can result in complete failure of a driver. That would allow the drivers to dip into the page allocation reserves similar to what can happen in IRQ context (which is not the context here) and hope kswapd catches up before the reserve is depleted. It might be lower risk overall. -- You are receiving this mail because: You are on the CC list for the bug.
http://bugzilla.opensuse.org/show_bug.cgi?id=959230
http://bugzilla.opensuse.org/show_bug.cgi?id=959230#c36
Oliver Neukum
http://bugzilla.opensuse.org/show_bug.cgi?id=959230
http://bugzilla.opensuse.org/show_bug.cgi?id=959230#c37
Oliver Neukum
http://bugzilla.opensuse.org/show_bug.cgi?id=959230
http://bugzilla.opensuse.org/show_bug.cgi?id=959230#c38
--- Comment #38 from Toomas Aas
http://bugzilla.opensuse.org/show_bug.cgi?id=959230
http://bugzilla.opensuse.org/show_bug.cgi?id=959230#c39
--- Comment #39 from Takashi Iwai
Created attachment 762405 [details] Logs from booting with kernel-default-4.4.117-1.1.gd04fda6
Hi Oliver!
I installed the kernel from comment #37, but this causes a side effect. The graphical login screen is frozen. The mouse cursor moves, but clicking mouse buttons or typing on keyboard has no effect. However, I can Ctrl+Alt+F2 to text console and keyboard works there. But this probem means that I cannot do any longer term testing with this kernel.
Attached are logs from booting the new kernel and previous 4.4.114 kernel.
Hrm, there is anything wrong in the kernel logs, and it's radeon driver that was overridden by drm-kmp, so it should be same in both cases. In anyway, try to test with nomodeset boot option temporarily just for checking the WiFi behavior. Also, later on, you can try to uninstall drm-kmp-default and see whether the graphics issue persists. If not, you can keep it uninstalled. -- You are receiving this mail because: You are on the CC list for the bug.
http://bugzilla.opensuse.org/show_bug.cgi?id=959230
http://bugzilla.opensuse.org/show_bug.cgi?id=959230#c40
--- Comment #40 from Takashi Iwai
(In reply to Toomas Aas from comment #38)
Created attachment 762405 [details] Logs from booting with kernel-default-4.4.117-1.1.gd04fda6
Hi Oliver!
I installed the kernel from comment #37, but this causes a side effect. The graphical login screen is frozen. The mouse cursor moves, but clicking mouse buttons or typing on keyboard has no effect. However, I can Ctrl+Alt+F2 to text console and keyboard works there. But this probem means that I cannot do any longer term testing with this kernel.
Attached are logs from booting the new kernel and previous 4.4.114 kernel.
Hrm, there is anything wrong in the kernel logs, and it's radeon driver that was overridden by drm-kmp, so it should be same in both cases. In anyway, try to test with nomodeset boot option temporarily just for checking the WiFi behavior.
Also, later on, you can try to uninstall drm-kmp-default and see whether the graphics issue persists. If not, you can keep it uninstalled.
The crash with drm-kmp turned out to be the side-effect of the recent fix for Spectre. It's being tracked down. Right now for testing, please disable KMS via nomodeset option or uninstall drm-kmp-default temporarily to go back to the 4.4.x radeon driver. -- You are receiving this mail because: You are on the CC list for the bug.
http://bugzilla.opensuse.org/show_bug.cgi?id=959230
http://bugzilla.opensuse.org/show_bug.cgi?id=959230#c41
--- Comment #41 from Takashi Iwai
http://bugzilla.opensuse.org/show_bug.cgi?id=959230
http://bugzilla.opensuse.org/show_bug.cgi?id=959230#c42
--- Comment #42 from Oliver Neukum
http://bugzilla.opensuse.org/show_bug.cgi?id=959230
http://bugzilla.opensuse.org/show_bug.cgi?id=959230#c43
Toomas Aas
http://bugzilla.opensuse.org/show_bug.cgi?id=959230
http://bugzilla.opensuse.org/show_bug.cgi?id=959230#c44
--- Comment #44 from Oliver Neukum
http://bugzilla.opensuse.org/show_bug.cgi?id=959230
http://bugzilla.opensuse.org/show_bug.cgi?id=959230#c45
Oliver Neukum
http://bugzilla.opensuse.org/show_bug.cgi?id=959230
http://bugzilla.opensuse.org/show_bug.cgi?id=959230#c46
Toomas Aas
http://bugzilla.opensuse.org/show_bug.cgi?id=959230
http://bugzilla.opensuse.org/show_bug.cgi?id=959230#c48
Toomas Aas
http://bugzilla.opensuse.org/show_bug.cgi?id=959230
http://bugzilla.opensuse.org/show_bug.cgi?id=959230#c51
--- Comment #51 from Toomas Aas
http://bugzilla.opensuse.org/show_bug.cgi?id=959230
Toomas Aas
http://bugzilla.opensuse.org/show_bug.cgi?id=959230
Oliver Neukum
participants (1)
-
bugzilla_noreply@novell.com