[Bug 1082704] New: compiling nvidia kernel modules uses wrong kernel tree?
http://bugzilla.opensuse.org/show_bug.cgi?id=1082704 Bug ID: 1082704 Summary: compiling nvidia kernel modules uses wrong kernel tree? Classification: openSUSE Product: openSUSE Tumbleweed Version: Current Hardware: Other OS: Other Status: NEW Severity: Normal Priority: P5 - None Component: X11 3rd Party Driver Assignee: xorg-maintainer-bugs@forge.provo.novell.com Reporter: P.Suetterlin@royac.iac.es QA Contact: sndirsch@suse.com Found By: --- Blocker: --- My desktop has a NVidia GTX 1060, I use the proprietary driver from the nvidia repo and recent Tumbleweed. The drivers kernel modules are version is nvidia-gfxG04-kmp-default-390.25_k4.15.2_1-10.1.x86_64, the kernel (TW 20180221) is 4.15.4-1-default. For checking some other issue, I had rebooted to the previous kernel (4.15.2-1-default) today. The drivers wouldn't load. I have /lib/modules/4.15.2-1-default/updates/nvidia.ko /lib/modules/4.15.4-1-default/weak-updates/updates/nvidia.ko -> /lib/modules/4.15.2-1-default/updates/nvidia.ko However, modinfo /lib/modules/4.15.2-1-default/updates/nvidia.ko filename: /lib/modules/4.15.2-1-default/updates/nvidia.ko alias: char-major-195-* version: 390.25 supported: external license: NVIDIA srcversion: B5B1CA3087B567ADFADC070 alias: pci:v000010DEd00000E00sv*sd*bc04sc80i00* alias: pci:v000010DEd*sv*sd*bc03sc02i00* alias: pci:v000010DEd*sv*sd*bc03sc00i00* depends: ipmi_msghandler retpoline: Y name: nvidia vermagic: 4.15.4-1-default SMP preempt mod_unload modversions or strings /lib/modules/4.15.2-1-default/updates/nvidia.ko | grep 4\\.15 4.15.4-1H /usr/src/linux-4.15.4-1/include/linux/dma-mapping.h /usr/src/linux-4.15.4-1/include/linux/dma-mapping.h ...... Seems it was compiled using the 4.15.4 kernel tree, but installed in the 4.15.2 modules directory? No manual compiles etc., just using standard 'zypper dup' -- You are receiving this mail because: You are on the CC list for the bug.
http://bugzilla.opensuse.org/show_bug.cgi?id=1082704 http://bugzilla.opensuse.org/show_bug.cgi?id=1082704#c3 Stefan Dirsch <sndirsch@suse.com> changed: What |Removed |Added ---------------------------------------------------------------------------- CC| |P.Suetterlin@royac.iac.es Flags| |needinfo?(P.Suetterlin@roya | |c.iac.es) --- Comment #3 from Stefan Dirsch <sndirsch@suse.com> --- First, there is nothing wrong with the weak-updates compatibility symlink. Please attach dmesg output with the kernel where it doesn't work. -- You are receiving this mail because: You are on the CC list for the bug.
http://bugzilla.opensuse.org/show_bug.cgi?id=1082704 http://bugzilla.opensuse.org/show_bug.cgi?id=1082704#c4 Peter Sütterlin <P.Suetterlin@royac.iac.es> changed: What |Removed |Added ---------------------------------------------------------------------------- Flags|needinfo?(P.Suetterlin@roya | |c.iac.es) | --- Comment #4 from Peter Sütterlin <P.Suetterlin@royac.iac.es> --- No, of course the weak-update links are as they should be. But it should be the 4.15.2 module in /lib/modules/4.15.2-1-default/updates. It obviously is not. Here's the dmesg when I tried to manually load the module in kernel 4.15.2 (I had booted to runlevel 3, logged in on console and typed the modprobe command): Feb 24 20:07:05 lux login[1775]: ROOT LOGIN ON tty1 Feb 24 20:07:23 lux kernel: nvidia: disagrees about version of symbol kmem_cache_alloc_trace Feb 24 20:07:23 lux kernel: nvidia: Unknown symbol kmem_cache_alloc_trace (err -22) Feb 24 20:07:23 lux kernel: nvidia: disagrees about version of symbol kmem_cache_alloc Feb 24 20:07:23 lux kernel: nvidia: Unknown symbol kmem_cache_alloc (err -22) Feb 24 20:07:23 lux kernel: nvidia: disagrees about version of symbol kmem_cache_free Feb 24 20:07:23 lux kernel: nvidia: Unknown symbol kmem_cache_free (err -22) There were similar/identical messages earlier in the boot, as the machine is nvidia-only, and the module is supposed to load already from initrd, which of course also failed. It is supposedly the same issue that you requested a separate report for in https://bugzilla.opensuse.org/show_bug.cgi?id=1080742#c113 -- You are receiving this mail because: You are on the CC list for the bug.
http://bugzilla.opensuse.org/show_bug.cgi?id=1082704 http://bugzilla.opensuse.org/show_bug.cgi?id=1082704#c5 --- Comment #5 from Peter Sütterlin <P.Suetterlin@royac.iac.es> --- Created attachment 761606 --> http://bugzilla.opensuse.org/attachment.cgi?id=761606&action=edit zypp history from upgrade This is the /var/log/zypp/history from the upgrade that installed both new kernel (4.15.4) and nvidia modules. Line 148 is the start of the additional rpm output for kernel-default-devel-4.15.4-1.5, which compiles the nvidia module in /usr/src/linux-4.15.4-1-obj/x86_64/default Line 1156+ then does this again when installing nvidia-gfxG04-kmp-default-390.25_k4.15.2_1-10.1, again in the new kernel tree (line 1359) The log starts with warnings that # Warning: /lib/modules/4.15.1-1-default is inconsistent # Warning: /lib/modules/4.15.2-1-default is inconsistent (lines 1328/1337) To me it looks like the installation scripts will use the latest/highest installed kernel tree, but are hardcoded where to store the result. -- You are receiving this mail because: You are on the CC list for the bug.
http://bugzilla.opensuse.org/show_bug.cgi?id=1082704 http://bugzilla.opensuse.org/show_bug.cgi?id=1082704#c7 --- Comment #7 from Peter Sütterlin <P.Suetterlin@royac.iac.es> --- (In reply to Stefan Dirsch from comment #6)
My guess is, that you have various versions of kernel-default-devel, kernel-devel, kernel-source installed.
Yes, indeed. lux:~ # rpm -q kernel-default-devel kernel-devel kernel-default-devel-4.15.2-1.4.x86_64 kernel-default-devel-4.15.4-1.5.x86_64 kernel-devel-4.15.2-1.4.noarch kernel-devel-4.15.4-1.5.noarch nvidia-gfxG04-kmp-default requires kernel-default-devel, and this gets (also) updated with every new kernel version when running 'zypper dup'. So I'd assume every user of the nvidia repo would be in that situation?
I cannot investigate that issue without direct access to the system. I you want to investigate the issue yourself, you would need to run %post of nvidia-gfxG04-kmp-default manually. Please check out via
rpm --scripts -q nvidia-gfxG04-kmp-default
I had a look at those before, and also at the makefiles etc, but could so far not spot where it decides to use the latest devel version... going through it again now. There's no mantion of any update-alternatives though. The postinstall scriptlet only consists of ----- postinstall scriptlet (using /bin/sh): nvr=nvidia-gfxG04-kmp-default-390.25_k4.15.2_1-10.1 wm2=/usr/lib/module-init-tools/weak-modules2 if [ -x $wm2 ]; then INITRD_IN_POSTTRANS=1 /bin/bash -${-/e/} $wm2 --add-kmp $nvr fi ----- And running that does not compile anything, I think it only creates the links. So then the suspicion is it is the kernel-default-devel package. The zypper log shows that that one did actually compile the nvidia modules. I checked the postinstall of that one, but that only does the ..obj links. Still, the log has # 2018-02-22 23:24:25 kernel-default-devel-4.15.4-1.5.x86_64.rpm installed ok # Additional rpm output: # Changing symlink /usr/src/linux-obj/x86_64/default from ../../linux-4.15.2-1-obj/x86_64/default to ../../linux-4.15.4-1-obj/x86_64/default # /usr/src/kernel-modules/nvidia-390.25-default / # rm -f -r conftest # make[1]: Entering directory '/usr/src/linux-4.15.2-1' The 'Changing symlink' is from the post script. No idea why it starts compiling now. Does it call dkms or something? -- You are receiving this mail because: You are on the CC list for the bug.
http://bugzilla.opensuse.org/show_bug.cgi?id=1082704 http://bugzilla.opensuse.org/show_bug.cgi?id=1082704#c8 --- Comment #8 from Stefan Dirsch <sndirsch@suse.com> --- This sounds weird. On my test system: # rpm --scripts -q nvidia-gfxG04-kmp-default [...] postinstall scriptlet (using /bin/sh): arch=x86_64 flavor=default kver=$(make -sC /usr/src/linux-obj/$arch/$flavor kernelrelease) make -C /usr/src/linux-obj/$arch/$flavor \ modules \ M=/usr/src/kernel-modules/nvidia-390.25-$flavor \ SYSSRC=/lib/modules/$kver/source \ SYSOUT=/usr/src/linux-obj/$arch/$flavor pushd /usr/src/kernel-modules/nvidia-390.25-$flavor make -f Makefile \ nv-linux.o \ SYSSRC=/lib/modules/$kver/source \ SYSOUT=/usr/src/linux-obj/$arch/$flavor popd install -m 755 -d /lib/modules/4.4.76-1-$flavor/updates install -m 644 /usr/src/kernel-modules/nvidia-390.25-$flavor/nvidia*.ko \ /lib/modules/4.4.76-1-$flavor/updates depmod 4.4.76-1-$flavor [...] Oh. And there are also trigger scripts to rebuild the kernel module once a new kernel is being installed (kind of KMS reimplemented on package level). # rpm --triggers -q nvidia-gfxG04-kmp-default triggerpostun scriptlet (using /bin/sh) -- drm-kmp-default flavor=default pushd /usr/src/kernel-modules/nvidia-390.25-$flavor || true cp -a Makefile{,.tmp} || true make clean || true mv Makefile{.tmp,} || true popd || true arch=x86_64 flavor=default kver=$(make -sC /usr/src/linux-obj/$arch/$flavor kernelrelease) make -C /usr/src/linux-obj/$arch/$flavor \ modules \ M=/usr/src/kernel-modules/nvidia-390.25-$flavor \ SYSSRC=/lib/modules/$kver/source \ SYSOUT=/usr/src/linux-obj/$arch/$flavor pushd /usr/src/kernel-modules/nvidia-390.25-$flavor make -f Makefile \ nv-linux.o \ SYSSRC=/lib/modules/$kver/source \ SYSOUT=/usr/src/linux-obj/$arch/$flavor popd install -m 755 -d /lib/modules/4.4.76-1-$flavor/updates install -m 644 /usr/src/kernel-modules/nvidia-390.25-$flavor/nvidia*.ko \ /lib/modules/4.4.76-1-$flavor/updates depmod 4.4.76-1-$flavor [...] If you have complete different scripts in your packages, you're using complete different packages. -- You are receiving this mail because: You are on the CC list for the bug.
http://bugzilla.opensuse.org/show_bug.cgi?id=1082704 http://bugzilla.opensuse.org/show_bug.cgi?id=1082704#c9 --- Comment #9 from Peter Sütterlin <P.Suetterlin@royac.iac.es> --- I definitely have different packages, as I am running Tumbleweed, whereas yours seem to indicate Leap. I just re-downloaded the rpm package from nvidia and checked the scripts again, they are identical, and the postinstall only has the mentioned call to /usr/lib/module-init-tools/weak-modules2 And while I did have a look at the triggers of the kernel-default-devel package (there are none), I forgot to do the same for the nvidia pakage :(( And indeed, they show what I suspected: -------- triggerin scriptlet (using /bin/sh) -- kernel-default-devel flavor=default pushd /usr/src/kernel-modules/nvidia-390.25-$flavor || true cp -a Makefile{,.tmp} || true make clean || true mv Makefile{.tmp,} || true popd || true arch=x86_64 flavor=default kver=$(make -sC /usr/src/linux-obj/$arch/$flavor kernelrelease) make -C /usr/src/linux-obj/$arch/$flavor \ modules \ M=/usr/src/kernel-modules/nvidia-390.25-$flavor \ SYSSRC=/lib/modules/$kver/source \ SYSOUT=/usr/src/linux-obj/$arch/$flavor pushd /usr/src/kernel-modules/nvidia-390.25-$flavor make -f Makefile \ nv-linux.o \ SYSSRC=/lib/modules/$kver/source \ SYSOUT=/usr/src/linux-obj/$arch/$flavor popd install -m 755 -d /lib/modules/4.15.2-1-$flavor/updates install -m 644 /usr/src/kernel-modules/nvidia-390.25-$flavor/nvidia*.ko \ /lib/modules/4.15.2-1-$flavor/updates depmod 4.15.2-1-$flavor --------- It compiles for the latest kernel (kver resolves to 4.15.4-1-default), but then installs the modules to the (fixed) location of the old modules. IMHO the install should go to /lib/modules/$kver/updates -- You are receiving this mail because: You are on the CC list for the bug.
http://bugzilla.opensuse.org/show_bug.cgi?id=1082704 http://bugzilla.opensuse.org/show_bug.cgi?id=1082704#c10 Stefan Dirsch <sndirsch@suse.com> changed: What |Removed |Added ---------------------------------------------------------------------------- Flags|needinfo?(P.Suetterlin@roya | |c.iac.es) | --- Comment #10 from Stefan Dirsch <sndirsch@suse.com> --- Ok. You're right. I was testing on Leap 42.3. Indeed there we have the rebuild in %post and in the %triggers. On TW only in the %triggers. I also believe it's correct to install to /lib/modules/$kver/updates instead of the hardcoded path in the trigger scripts for TW. -- You are receiving this mail because: You are on the CC list for the bug.
http://bugzilla.opensuse.org/show_bug.cgi?id=1082704 http://bugzilla.opensuse.org/show_bug.cgi?id=1082704#c11 --- Comment #11 from Peter Sütterlin <P.Suetterlin@royac.iac.es> --- (In reply to Stefan Dirsch from comment #10)
Ok. You're right. I was testing on Leap 42.3. Indeed there we have the rebuild in %post and in the %triggers. On TW only in the %triggers.
one mystery solved :)
I also believe it's correct to install to /lib/modules/$kver/updates instead of the hardcoded path in the trigger scripts for TW.
I think it's wrong in both TW and Leap. kver=$(make -sC /usr/src/linux-obj/$arch/$flavor kernelrelease) will always point to the latest installed kernel. This *can* be the one the modules were compiled for originally. In that case $kver=4.4.76-1-$flavor (in your case). But if it's different, it would overwrite the original one with a 'broken' one. Likely the error doesn't show up in Leap, as that one doesn't really do big kernel jumps. In principle it's easy: If you compile against $kver, then also install in /lib/modules/$kver ..... -- You are receiving this mail because: You are on the CC list for the bug.
http://bugzilla.opensuse.org/show_bug.cgi?id=1082704 http://bugzilla.opensuse.org/show_bug.cgi?id=1082704#c12 --- Comment #12 from Stefan Dirsch <sndirsch@suse.com> --- The reason the issue doesn't show up on Leap 42.3 is, because there we are kABI compatible. And without using the fixed tree weak-updates mechanism wouldn't work. On TW we're no longer kABI compatible. Anyway, I've changed this now. But I haven't done any testing yet. Mon Feb 26 16:22:07 UTC 2018 - sndirsch@suse.com - rebuilded kernel modules in %trigger of TW packages should go to the tree against which the kernel module gets builded, not the hardcoded one during build of the package; introduced kmp-trigger.sh/kmp-trigger-old.sh script snippets for this based on kmp-post.sh/kmp-post-old.sh (boo#1082704) --> obs://X11:Drivers:Video/nvidia-gfxG04 You can do a manually build, if you want (check the README file). -- You are receiving this mail because: You are on the CC list for the bug.
http://bugzilla.opensuse.org/show_bug.cgi?id=1082704 http://bugzilla.opensuse.org/show_bug.cgi?id=1082704#c13 --- Comment #13 from Peter Sütterlin <P.Suetterlin@royac.iac.es> --- (In reply to Stefan Dirsch from comment #12)
The reason the issue doesn't show up on Leap 42.3 is, because there we are kABI compatible. And without using the fixed tree weak-updates mechanism wouldn't work.
On TW we're no longer kABI compatible.
Ah! Then you wouldn't use/need it in TW at all...
Anyway, I've changed this now. But I haven't done any testing yet.
Mon Feb 26 16:22:07 UTC 2018 - sndirsch@suse.com
- rebuilded kernel modules in %trigger of TW packages should go to the tree against which the kernel module gets builded, not the hardcoded one during build of the package; introduced kmp-trigger.sh/kmp-trigger-old.sh script snippets for this based on kmp-post.sh/kmp-post-old.sh (boo#1082704)
--> obs://X11:Drivers:Video/nvidia-gfxG04
You can do a manually build, if you want (check the README file).
I had a look at the kmp.post.sh, looks fine to me. I had just manually compiled the 4.15.2-1-default version of the modules, based on the old TW script, changing $kver and the related directory names (linux-obj -> linux-${kver%$flavor}obj). Went fine, depmod $kver doesn't give errors. Thanks! -- You are receiving this mail because: You are on the CC list for the bug.
http://bugzilla.opensuse.org/show_bug.cgi?id=1082704 http://bugzilla.opensuse.org/show_bug.cgi?id=1082704#c14 --- Comment #14 from Stefan Dirsch <sndirsch@suse.com> --- Probably you need kmp-trigger.sh. Thanks for giving it a try! -- You are receiving this mail because: You are on the CC list for the bug.
http://bugzilla.opensuse.org/show_bug.cgi?id=1082704 http://bugzilla.opensuse.org/show_bug.cgi?id=1082704#c15 --- Comment #15 from Peter Sütterlin <P.Suetterlin@royac.iac.es> --- I'm not really familiar with obs - is there an easy way to get a real package based on this (or a src.rpm), or is it scheduled to be included in the official nvidia repo? I'm currently holding back the update of my nvidia box (there's a kernel update), to check if things work properly.... -- You are receiving this mail because: You are on the CC list for the bug.
http://bugzilla.opensuse.org/show_bug.cgi?id=1082704 http://bugzilla.opensuse.org/show_bug.cgi?id=1082704#c16 --- Comment #16 from Stefan Dirsch <sndirsch@suse.com> --- We cannot build the RPMs in obs, only provide the package sources. Due to legal reasons. :-( So you would need to build the package itself. See my comment #12. I cannot push the changes to NVIDIA before I have tested my changes myself. -- You are receiving this mail because: You are on the CC list for the bug.
http://bugzilla.opensuse.org/show_bug.cgi?id=1082704 http://bugzilla.opensuse.org/show_bug.cgi?id=1082704#c17 --- Comment #17 from Peter Sütterlin <P.Suetterlin@royac.iac.es> --- Puh, took a while to master obs... So I built a package now, nvidia-gfxG04-kmp-default-390.25_k4.15.5_1-0.x86_64.rpm I installed it on my system, which still runs 4.15.4-1, and it correctly compiled the modules for this kernel and installed them in /lib/modules/4.15.4-1-default/updates/. No error from depmod. So far, so good! However, it looks like the weak-updates script has messed up things with the older kernel that is still installed (4.15.2-1). That one did have (proper) modules in /lib/modules/4.15.2-1-default/updates/. After installing the new package, this directory had been deleted, and instead a weak-updates/updates directory had been created, with links to /lib/modules/4.15.5-1-default/updates/ (i.e., the ones that came with the rpm). Furthermore, those modules have been deleted, too (don't know whether by the install, or by the weak-update script - probably the latter), so the links go to nowhere :( As you mentioned yourself, the weak-updates mechanism relies on kABI stability, which is not given for Tumbleweed. It probably should not be used at all? (and while at it - would there be a reason not to use a 'make -j ' for compiling the modules in the trigger script?) -- You are receiving this mail because: You are on the CC list for the bug.
http://bugzilla.opensuse.org/show_bug.cgi?id=1082704 http://bugzilla.opensuse.org/show_bug.cgi?id=1082704#c18 --- Comment #18 from Peter Sütterlin <P.Suetterlin@royac.iac.es> --- Final progress note: Just did a zypper dup (-> new kernel) with the selfcompiled version installed. Builds and installs the module for the new kernel correctly, and this time did not touch the previously compiled ones of the older kernel. (and I just realize that the reported removal of the 4.15.2 modules was of course because they were part of nvidia-gfxG04-kmp-default-390.25_k4.15.2_1-10.1, which gets removed by an update....) -- You are receiving this mail because: You are on the CC list for the bug.
http://bugzilla.opensuse.org/show_bug.cgi?id=1082704 http://bugzilla.opensuse.org/show_bug.cgi?id=1082704#c19 --- Comment #19 from Stefan Dirsch <sndirsch@suse.com> --- I agree and removed weak-updates run on TW. Wed Feb 28 13:50:50 UTC 2018 - sndirsch@suse.com - do not run weak-updates on TW, since it creates more harm than benefit (boo#1082704) -- You are receiving this mail because: You are on the CC list for the bug.
http://bugzilla.opensuse.org/show_bug.cgi?id=1082704 http://bugzilla.opensuse.org/show_bug.cgi?id=1082704#c21 Sebastian Turza��ski <dpbasti@wp.pl> changed: What |Removed |Added ---------------------------------------------------------------------------- CC| |dpbasti@wp.pl --- Comment #21 from Sebastian Turza��ski <dpbasti@wp.pl> --- I think this is again the case after june 2021 updates I get kernel: nvidia: disagrees about version of symbol module_layout but i only have 5.12.9-1.1 version sudo rpm -qa |grep kernel kernel-firmware-qcom-20210503-1.2.noarch kernel-firmware-network-20210503-1.2.noarch kernel-firmware-radeon-20210503-1.2.noarch kernel-firmware-ath11k-20210503-1.2.noarch kernel-firmware-realtek-20210503-1.2.noarch kernel-firmware-sound-20210503-1.2.noarch kernel-firmware-usb-network-20210503-1.2.noarch kernel-firmware-qlogic-20210503-1.2.noarch kernel-firmware-chelsio-20210503-1.2.noarch kernel-firmware-bnx2-20210503-1.2.noarch kernel-firmware-all-20210503-1.2.noarch kernel-firmware-ti-20210503-1.2.noarch kernel-firmware-marvell-20210503-1.2.noarch kernel-firmware-atheros-20210503-1.2.noarch kernel-firmware-liquidio-20210503-1.2.noarch kernel-firmware-bluetooth-20210503-1.2.noarch kernel-firmware-mediatek-20210503-1.2.noarch kernel-firmware-serial-20210503-1.2.noarch kernel-firmware-intel-20210503-1.2.noarch kernel-firmware-nfp-20210503-1.2.noarch kernel-firmware-nvidia-20210503-1.2.noarch kernel-firmware-mellanox-20210503-1.2.noarch kernel-firmware-dpaa2-20210503-1.2.noarch kernel-firmware-iwlwifi-20210503-1.2.noarch kernel-firmware-amdgpu-20210503-1.2.noarch purge-kernels-service-0-8.1.noarch kernel-firmware-i915-20210503-1.2.noarch kernel-firmware-ueagle-20210503-1.2.noarch kernel-firmware-brcm-20210503-1.2.noarch kernel-firmware-mwifiex-20210503-1.2.noarch kernel-firmware-media-20210503-1.2.noarch kernel-firmware-ath10k-20210503-1.2.noarch kernel-firmware-platform-20210503-1.2.noarch kernel-syms-5.12.9-1.1.x86_64 kernel-firmware-prestera-20210503-1.2.noarch kernel-devel-5.12.9-1.1.noarch kernel-default-devel-5.12.9-1.1.x86_64 kernel-source-5.12.9-1.1.noarch kernel-default-5.12.9-1.1.x86_64 kernel-macros-5.12.9-1.1.noarch -- You are receiving this mail because: You are on the CC list for the bug.
participants (2)
-
bugzilla_noreply@novell.com
-
bugzilla_noreply@suse.com