[Bug 1103356] New: nouveau: fan stays on maximum speed after fanboost
http://bugzilla.suse.com/show_bug.cgi?id=1103356 Bug ID: 1103356 Summary: nouveau: fan stays on maximum speed after fanboost Classification: openSUSE Product: openSUSE Distribution Version: Leap 15.0 Hardware: x86-64 OS: Other Status: NEW Severity: Normal Priority: P5 - None Component: Kernel Assignee: kernel-maintainers@forge.provo.novell.com Reporter: thomas.blume@suse.com QA Contact: qa-bugs@suse.de Found By: --- Blocker: --- I haven an older nvidia card: --> # hwinfo --gfxcard 23: PCI 100.0: 0300 VGA compatible controller (VGA) [Created at pci.378] Unique ID: VCu0.x9HhAPKYST4 Parent ID: vSkL.sBCJa6uSmM6 SysFS ID: /devices/pci0000:00/0000:00:01.0/0000:01:00.0 SysFS BusID: 0000:01:00.0 Hardware Class: graphics card Model: "nVidia Quadro FX 1500" Vendor: pci 0x10de "nVidia Corporation" Device: pci 0x029e "Quadro FX 1500" SubVendor: pci 0x10de "nVidia Corporation" SubDevice: pci 0x032c Revision: 0xa1 Driver: "nouveau" Driver Modules: "drm" Memory Range: 0xf2000000-0xf2ffffff (rw,non-prefetchable) Memory Range: 0xe0000000-0xefffffff (ro,non-prefetchable) Memory Range: 0xf1000000-0xf1ffffff (rw,non-prefetchable) I/O Ports: 0x4000-0x4fff (rw) IRQ: 28 (8673 events) I/O Ports: 0x3c0-0x3df (rw) Module Alias: "pci:v000010DEd0000029Esv000010DEsd0000032Cbc03sc00i00" Driver Info #0: XFree86 v4 Server Module: nv Config Status: cfg=new, avail=yes, need=no, active=unknown Attached to: #8 (PCI bridge) Primary display adapter: #23 --< This worked fine with the nouveau driver until Leap42.3. With Leap15 now, the graphics card fan starts running at maximum speed and never stops. The switch to maximum speed might be in context with temperature management. I get the following messages: --> 2018-08-01T07:16:01.878010+02:00 alpha kernel: [ 379.049048] nouveau 0000:01:00.0: therm: temperature (90 C) hit the 'fanboost' threshold 2018-08-01T07:16:08.881928+02:00 alpha kernel: [ 386.049548] nouveau 0000:01:00.0: therm: temperature (87 C) went below the 'fanboost' threshold --< I would expect that the fan speed decreases after the temperature goes below the fanboost threshold, but it doesn't. As written above, on 42.3, the fan stays nice quiet at low rotation speed: --> # cat /sys/class/drm/card0/device/hwmon/hwmon0/pwm1 20 --< Attaching the sysfs valume of noveau hwmon from 42.3 and 15. Any hint what I need to tune to get the 42.3 behaviour back? -- You are receiving this mail because: You are on the CC list for the bug.
http://bugzilla.suse.com/show_bug.cgi?id=1103356
http://bugzilla.suse.com/show_bug.cgi?id=1103356#c1
--- Comment #1 from Thomas Blume
http://bugzilla.suse.com/show_bug.cgi?id=1103356
http://bugzilla.suse.com/show_bug.cgi?id=1103356#c2
--- Comment #2 from Thomas Blume
http://bugzilla.suse.com/show_bug.cgi?id=1103356
http://bugzilla.suse.com/show_bug.cgi?id=1103356#c3
Takashi Iwai
http://bugzilla.suse.com/show_bug.cgi?id=1103356
http://bugzilla.suse.com/show_bug.cgi?id=1103356#c4
--- Comment #4 from Thomas Blume
Did you try the very latest kernel in OBS Kernel:openSUSE-15.0? Basically the nouveau drm driver on Leap 15.0 took all 4.14.y backports. It might be something missing in hwmon side, though.
Thanks for the hint Takashi. I've tried with: kernel-default-4.12.14-lp150.93.1.g8ee019b.x86_64 from https://download.opensuse.org/repositories/Kernel:/openSUSE-15.0/standard/ but the fan still runs at high speed, creating noise. /sys/class/drm/card0/device/hwmon/hwmon0/pwm1 shows it running at 70%. gpu temperature at 72°C. Rebooting to 42.3 and it goes back to 20%. -- You are receiving this mail because: You are on the CC list for the bug.
http://bugzilla.suse.com/show_bug.cgi?id=1103356
http://bugzilla.suse.com/show_bug.cgi?id=1103356#c5
--- Comment #5 from Takashi Iwai
http://bugzilla.suse.com/show_bug.cgi?id=1103356
http://bugzilla.suse.com/show_bug.cgi?id=1103356#c6
--- Comment #6 from Thomas Blume
So you tested Leap 42.3 kernel on top of Leap 15.0 system, or tested the whole Leap 42.3 system? If the latter, please try the former; just install Leap 42.3 kernel on top of Leap 15.0 (with --force --oldpackage or whatever option), and check whether the problem doesn't happen with it.
If the problem doesn't happen with Leap 42.3 kernel, then please try TW kernel. I checked the nouveau_hwmon code but there is no significant change, at least. So it must be really a high temperature due to some incorrect mode (no proper power saving, etc).
The problem doesn't happen with the Leap 42.3 kernel on top of Leap 15. It returns when installing the tumbleweed kernel on Leap 15. Checking the kernel log for differences, I've found that the message below is only shown with the 42.3 kernel: --> 2018-08-01T12:48:09.794514+02:00 linux-rr7g kernel: [ 7.769999] nouveau 0000:01:00.0: DRM: 0xC73F: Parsing digital output script table --< -- You are receiving this mail because: You are on the CC list for the bug.
http://bugzilla.suse.com/show_bug.cgi?id=1103356
http://bugzilla.suse.com/show_bug.cgi?id=1103356#c7
--- Comment #7 from Takashi Iwai
The problem doesn't happen with the Leap 42.3 kernel on top of Leap 15. It returns when installing the tumbleweed kernel on Leap 15.
Thanks, so this is a still remaining regression. Could you report it to upstream? e.g. bugzilla.freedesktop.org category DRI/Nouveau.
Checking the kernel log for differences, I've found that the message below is only shown with the 42.3 kernel:
--> 2018-08-01T12:48:09.794514+02:00 linux-rr7g kernel: [ 7.769999] nouveau 0000:01:00.0: DRM: 0xC73F: Parsing digital output script table --<
This is a part of BIOS parsing stuff, so something might be missing in the recent kernel relevant with it... -- You are receiving this mail because: You are on the CC list for the bug.
http://bugzilla.suse.com/show_bug.cgi?id=1103356
http://bugzilla.suse.com/show_bug.cgi?id=1103356#c8
--- Comment #8 from Thomas Blume
(In reply to Thomas Blume from comment #6)
The problem doesn't happen with the Leap 42.3 kernel on top of Leap 15. It returns when installing the tumbleweed kernel on Leap 15.
Thanks, so this is a still remaining regression. Could you report it to upstream? e.g. bugzilla.freedesktop.org category DRI/Nouveau.
Ah, that was the right pointer. I've checked the bug reports there and found the debug option for the nouveau driver. Activating it I can see this: --> # grep 'therm' /mnt/dmesg-4_17.txt [ 6.595497] nouveau 0000:01:00.0: therm: FAN control: PWM [ 6.595504] nouveau 0000:01:00.0: therm: parsing the fan table failed [ 6.595515] nouveau 0000:01:00.0: therm: fan management: automatic [ 6.595520] nouveau 0000:01:00.0: therm: FAN target request: 70% [ 6.595525] nouveau 0000:01:00.0: therm: FAN target: 70 [ 6.595529] nouveau 0000:01:00.0: therm: FAN update: 23 [ 6.595538] nouveau 0000:01:00.0: therm: internal sensor: yes [ 6.615401] nouveau 0000:01:00.0: therm: programmed thresholds [ 90(3), 95(3), 130(2), 135(5) ] [ 7.095580] nouveau 0000:01:00.0: therm: FAN update: 26 [ 7.595674] nouveau 0000:01:00.0: therm: FAN update: 29 [ 8.095757] nouveau 0000:01:00.0: therm: FAN update: 32 [ 8.595853] nouveau 0000:01:00.0: therm: FAN update: 35 [ 9.095938] nouveau 0000:01:00.0: therm: FAN update: 38 [ 9.596029] nouveau 0000:01:00.0: therm: FAN update: 41 [ 10.096105] nouveau 0000:01:00.0: therm: FAN update: 44 [ 10.597783] nouveau 0000:01:00.0: therm: FAN update: 47 [ 11.099110] nouveau 0000:01:00.0: therm: FAN update: 50 [ 11.600452] nouveau 0000:01:00.0: therm: FAN update: 53 [ 12.101842] nouveau 0000:01:00.0: therm: FAN update: 56 [ 12.603128] nouveau 0000:01:00.0: therm: FAN update: 59 [ 13.104425] nouveau 0000:01:00.0: therm: FAN update: 62 [ 13.604474] nouveau 0000:01:00.0: therm: FAN update: 65 [ 14.104522] nouveau 0000:01:00.0: therm: FAN update: 68 [ 14.606060] nouveau 0000:01:00.0: therm: FAN update: 70 --< Preparing the upstream bug report. -- You are receiving this mail because: You are on the CC list for the bug.
http://bugzilla.suse.com/show_bug.cgi?id=1103356
http://bugzilla.suse.com/show_bug.cgi?id=1103356#c9
--- Comment #9 from Takashi Iwai
http://bugzilla.suse.com/show_bug.cgi?id=1103356
http://bugzilla.suse.com/show_bug.cgi?id=1103356#c12
--- Comment #12 from Thomas Blume
I'm building a test kernel with the revert of a suspected commit (800efb4c2857ec543). It's being built on OBS home:tiwai:bsc1103356 repo.
Could you give it a try later?
This build fixes the issue on my machine. The dmesg logs show: --> Aug 03 11:45:19 linux-rr7g kernel: nouveau 0000:01:00.0: therm: FAN control: PWM Aug 03 11:45:19 linux-rr7g kernel: nouveau 0000:01:00.0: therm: parsing the fan table failed Aug 03 11:45:19 linux-rr7g kernel: nouveau 0000:01:00.0: therm: fan management: automatic Aug 03 11:45:19 linux-rr7g kernel: nouveau 0000:01:00.0: therm: internal sensor: yes Aug 03 11:45:19 linux-rr7g kernel: nouveau 0000:01:00.0: therm: programmed thresholds [ 90(3), 95(3), 130(2), 135(5) ] --> and the fan speed shows: --> # cat /sys/class/drm/card0/device/hwmon/hwmon0/pwm1 20 --< -- You are receiving this mail because: You are on the CC list for the bug.
http://bugzilla.suse.com/show_bug.cgi?id=1103356
Takashi Iwai
http://bugzilla.suse.com/show_bug.cgi?id=1103356
http://bugzilla.suse.com/show_bug.cgi?id=1103356#c13
--- Comment #13 from Takashi Iwai
http://bugzilla.suse.com/show_bug.cgi?id=1103356
http://bugzilla.suse.com/show_bug.cgi?id=1103356#c14
--- Comment #14 from Swamp Workflow Management
http://bugzilla.suse.com/show_bug.cgi?id=1103356
http://bugzilla.suse.com/show_bug.cgi?id=1103356#c15
--- Comment #15 from Swamp Workflow Management
http://bugzilla.suse.com/show_bug.cgi?id=1103356
http://bugzilla.suse.com/show_bug.cgi?id=1103356#c19
Takashi Iwai
http://bugzilla.suse.com/show_bug.cgi?id=1103356
http://bugzilla.suse.com/show_bug.cgi?id=1103356#c20
--- Comment #20 from Takashi Iwai
http://bugzilla.suse.com/show_bug.cgi?id=1103356
http://bugzilla.suse.com/show_bug.cgi?id=1103356#c23
--- Comment #23 from Swamp Workflow Management
http://bugzilla.suse.com/show_bug.cgi?id=1103356
http://bugzilla.suse.com/show_bug.cgi?id=1103356#c24
--- Comment #24 from Swamp Workflow Management
http://bugzilla.suse.com/show_bug.cgi?id=1103356
http://bugzilla.suse.com/show_bug.cgi?id=1103356#c25
--- Comment #25 from Swamp Workflow Management
http://bugzilla.suse.com/show_bug.cgi?id=1103356
http://bugzilla.suse.com/show_bug.cgi?id=1103356#c26
--- Comment #26 from Thomas Blume
I took a deeper look at the patch, and the issue looks like that either the reported temperature is wrong or the reported duty value is wrong.
For further debugging, I'm building a test kernel that adds some debug prints (via nkvm_debug() calls). It reverted the revert-patch and should show the buggy behavior again. Please test it later, and give back the debug messages (appear as "XXX ...").
Sorry I was on vacation. Will test ASAP. -- You are receiving this mail because: You are on the CC list for the bug.
http://bugzilla.suse.com/show_bug.cgi?id=1103356
http://bugzilla.suse.com/show_bug.cgi?id=1103356#c27
Thomas Blume
http://bugzilla.suse.com/show_bug.cgi?id=1103356
http://bugzilla.suse.com/show_bug.cgi?id=1103356#c28
Takashi Iwai
http://bugzilla.suse.com/show_bug.cgi?id=1103356
http://bugzilla.suse.com/show_bug.cgi?id=1103356#c29
--- Comment #29 from Takashi Iwai
http://bugzilla.suse.com/show_bug.cgi?id=1103356
http://bugzilla.suse.com/show_bug.cgi?id=1103356#c30
--- Comment #30 from Takashi Iwai
http://bugzilla.suse.com/show_bug.cgi?id=1103356
http://bugzilla.suse.com/show_bug.cgi?id=1103356#c31
Thomas Blume
http://bugzilla.suse.com/show_bug.cgi?id=1103356
http://bugzilla.suse.com/show_bug.cgi?id=1103356#c32
--- Comment #32 from Takashi Iwai
http://bugzilla.suse.com/show_bug.cgi?id=1103356
http://bugzilla.suse.com/show_bug.cgi?id=1103356#c33
--- Comment #33 from Takashi Iwai
http://bugzilla.suse.com/show_bug.cgi?id=1103356
http://bugzilla.suse.com/show_bug.cgi?id=1103356#c34
--- Comment #34 from Thomas Blume
Thanks. The behavior at this time looks normal, isn't it?
The measured temperature at start was 77.14C (= 7154 * 458/10000 - 25051/100), and it went down to 55.89C (= 6690), slightly up to 68C (6968).
Does the boost behavior appear when you turn off the console loglevel?
And my wild guess now is that it's because polling is disabled when entering this mode. Will cook up another test patch.
The fan boost is indeed gone, with or without nouveau debug logging. Still, the fan stays noisy /sys/devices/pci0000\:00/0000\:00\:01.0/0000\:01\:00.0/hwmon/hwmon0/pwm1 shows 78 with and at 80 without debug logging, even though there is very low gpu load. I'd expect that the fan speed decreases as the gpu cools down. Will now try your latest patch and report. -- You are receiving this mail because: You are on the CC list for the bug.
http://bugzilla.suse.com/show_bug.cgi?id=1103356
http://bugzilla.suse.com/show_bug.cgi?id=1103356#c35
--- Comment #35 from Thomas Blume
A test kernel with a hopefully working patch is being built in OBS home:tiwai:bsc1103356-test2 repo. Please give it a try later.
Looks better, no more fan boost, but still the fan reaches an annoying noise level. The fan speed stays at 63. Attaching the new debug logs below. -- You are receiving this mail because: You are on the CC list for the bug.
http://bugzilla.suse.com/show_bug.cgi?id=1103356
http://bugzilla.suse.com/show_bug.cgi?id=1103356#c36
Thomas Blume
http://bugzilla.suse.com/show_bug.cgi?id=1103356
http://bugzilla.suse.com/show_bug.cgi?id=1103356#c37
--- Comment #37 from Takashi Iwai
http://bugzilla.suse.com/show_bug.cgi?id=1103356
http://bugzilla.suse.com/show_bug.cgi?id=1103356#c38
--- Comment #38 from Takashi Iwai
http://bugzilla.suse.com/show_bug.cgi?id=1103356
http://bugzilla.suse.com/show_bug.cgi?id=1103356#c39
--- Comment #39 from Takashi Iwai
http://bugzilla.suse.com/show_bug.cgi?id=1103356
http://bugzilla.suse.com/show_bug.cgi?id=1103356#c40
--- Comment #40 from Swamp Workflow Management
http://bugzilla.suse.com/show_bug.cgi?id=1103356
http://bugzilla.suse.com/show_bug.cgi?id=1103356#c41
--- Comment #41 from Swamp Workflow Management
http://bugzilla.suse.com/show_bug.cgi?id=1103356
http://bugzilla.suse.com/show_bug.cgi?id=1103356#c44
--- Comment #44 from Swamp Workflow Management
http://bugzilla.suse.com/show_bug.cgi?id=1103356
http://bugzilla.suse.com/show_bug.cgi?id=1103356#c45
--- Comment #45 from Swamp Workflow Management
http://bugzilla.suse.com/show_bug.cgi?id=1103356
http://bugzilla.suse.com/show_bug.cgi?id=1103356#c47
--- Comment #47 from Swamp Workflow Management
http://bugzilla.suse.com/show_bug.cgi?id=1103356
http://bugzilla.suse.com/show_bug.cgi?id=1103356#c48
--- Comment #48 from Swamp Workflow Management
http://bugzilla.suse.com/show_bug.cgi?id=1103356
http://bugzilla.suse.com/show_bug.cgi?id=1103356#c50
Takashi Iwai
http://bugzilla.suse.com/show_bug.cgi?id=1103356
http://bugzilla.suse.com/show_bug.cgi?id=1103356#c51
--- Comment #51 from Swamp Workflow Management
http://bugzilla.suse.com/show_bug.cgi?id=1103356
http://bugzilla.suse.com/show_bug.cgi?id=1103356#c52
--- Comment #52 from Swamp Workflow Management
participants (1)
-
bugzilla_noreply@novell.com