TW and NVIDIA proprietary drivers
Hello everyone, I am quite new to the opensuse community, I hope this is the right place to talk about this issue. Please let me know if it is not! Since the beginning of last year when I switched to opensuse TW (from Debian testing/unstable) I repeatedly ran into issues with the NVIDIA proprietary drivers (that I was absolutely not used to coming from Debian) and I wanted to ask how to deal with these issues (fortunately apart from that TW has so far always been rock-solid!). For me (and many others) the problems started with kernel 5.9 that broke CUDA. At that time I was quite new to openuse and rolling distros and therefore I made a couple of mistakes that wasted a lot of time. In particular I did not know how to get back to an older kernel or driver version. For some reason I noticed the problem only way after the introduction of 5.9 and – I don't quite remember that part – either I didn't have any pre-5.9 kernel present on the system anymore or for some reason the older kernel had another issue. Anyway I was stuck with 5.9 and couldn't get CUDA to work that I depended on (also why is CUDA not included in the TW repos?). One thing that – as far as I'm aware – complicated things further is the fact that for some reason after a driver or kernel up- or downgrade the kernel modules of the NVIDIA driver were not correctly built for all of the installed kernels. Maybe there is also some misunderstanding on my side about how this works, but with my limited experience on opensuse I always ran into some problem (like version mismatch) when booting into an older kernel. Fortunately after a very long time of trying to figure out how to get out of this mess that I got myself into, finally the 5.9-compatible driver was released (but there also was some confusion about from which version on the fix was actually included) and I could check off that box, thinking that this was a once-in-a-lifetime problem due to a breaking change in the kernel that would most likely not reappear in the nearer future (also because I learned some tricks like how to use snapper and tumbleweed-cli and felt safe about potential future issues – great!). Unfortunately I was wrong about this being a one-time thing. Now with kernel 5.12 and NVIDIA 460.80 I have a new issue, NVenc is completely broken. And again I only noticed the issue when it was too late (my bad). As far as I am aware the NVIDIA 460.80 driver was released on May 16 and my last snapper snapshot is from May 17 if I don't want to rollback to November 2020. I am not entirely sure if the kernel or the driver broke NVenc, but it doesn't matter because with kernels I have a similar story: I have only 2 versions of 5.12 installed and an old 5.8 version that for some reason will not boot. Seems like I'm stuck with a broken combination again. End of rant/problem description. Of course I know that some measures that I could have taken very easily, like checking for malfunction immediately after the updates, would have saved me a lot of trouble. I will definitely be more careful in the future. However I would still like to hear your thoughts about how to better deal with this issue. What are the best practices when dealing with new driver and kernel versions? And when the system is already borked, what to do? Snapper has saved me a couple of times already, but aren't there any other solutions? In particular, I think it should be possible to install an older kernel from the TW repo history, but can I also do this for the NVIDIA driver? I'm not aware of any older version of the driver present in the repos. For me as a newcomes this really is a bit of a mess... Thank you in advance for your thoughts and help! Joe
hi, Am 22.05.21 um 13:47 schrieb Rhaytana:
On 5/22/21 1:20 PM, Kays wrote:
* I installed the hard way. https://en.opensuse.org/SDB:NVIDIA_the_hard_way .
* I follow the reports of the webmaster at http://rglinuxtech.com/ , who has
* I download the drivers from: https://download.nvidia.com/XFree86/Linux-x86_64/ , and always expect to boot to terminal to *# sh NVIDIA-Linux-x86-etc.run* after every kernel update. (Followed by *mkinitrd*).
this is exactly what i always do, and it works this way for me now since many years. -- Best Regards | Freundliche Grüße | Cordialement | Cordiali Saluti | Atenciosamente | Saludos Cordiales *DI Rainer Klier* DevOps, Research & Development
On Saturday 22 May 2021, Kays wrote:
Hello everyone, ...
However I would still like to hear your thoughts about how to better deal with this issue. What are the best practices when dealing with new driver and kernel versions? And when the system is already borked, what to do? Snapper has saved me a couple of times already, but aren't there any other solutions? In particular, I think it should be possible to install an older kernel from the TW repo history, but can I also do this for the NVIDIA driver? I'm not aware of any older version of the driver present in the repos. For me as a newcomes this really is a bit of a mess...
Thank you in advance for your thoughts and help! Joe
You could set multiversion.kernels in /etc/zypp/zypp.conf and keep a kernel that is known to be good. It could be quite an old kernel (one from the last few months). I think something like the following should retain 5.10.3 multiversion.kernels = 5.10.3,latest,latest-1,running In the event of an issue, yast boot manger can be used to set the default boot to be 5.10.3. Then at worst you would have to boot to the old kernel and manually force a reinstall of the nvidia drivers, which can be done with a --force option to zypper or via yast by ticking the nvidia packages for update regardless of them already being installed. You could also preserve a working nvidia driver just in case the lastest are broken. You would then have a totally independent fallback. A process for preserving a driver is simple if it was installed the "hard way", but a preservation process for the driver installed the "easy way" is simple to setup too. For example, you could edit /etc/zypp/repos.d/NVIDIA.repo and set set autorefresh=0 keeppackages=1, Then the packages would then be retained in the cache: /var/cache/zypp/packages/NVIDIA If the cache is currently empty, a forced install could be used to initialise it with the current driver. If you prefer to use yast, Software Management -> Configuration -> Repositories can also be used to toggle these flags whenever necessary. Michael
Dear Michael, thank you for your reply, I totally did not know about the keeppackages option! It sounds like a great way to keep older NVIDIA driver versions as a backup. Does it survive a zypper clean (-a)? I already knew the multikernel option, and that's how I managed to keep the 5.8 kernel, but unfortunately for some reason it won't boot anymore, that's kind of weird. I think installing an older kernels from the TW package history (http://download.opensuse.org/history/) is the way to go in this case. So, all in all I think the problem could be easily solved with these 3 options: keeping old packages from the NVIDIA repo, setting up multikernel in the zypper config, and downgrading via package history in case multikernel somehow messes up. Now the only open question for me is: How do I fix the current situation, where I don't have a cached version of the NVIDIA driver? Unfortunately it seems there is no such version history for the NVIDIA repository. Is there still some way to get my hands on an older version (preferably without having to do the installation 'the hard way')? Thanks a lot! Joe
On Monday 24 May 2021, Joe Kays wrote:
Dear Michael,
thank you for your reply, I totally did not know about the keeppackages option! It sounds like a great way to keep older NVIDIA driver versions as a backup. Does it survive a zypper clean (-a)?
I already knew the multikernel option, and that's how I managed to keep the 5.8 kernel, but unfortunately for some reason it won't boot anymore, that's kind of weird. I think installing an older kernels from the TW package history (http://download.opensuse.org/history/) is the way to go in this case.
So, all in all I think the problem could be easily solved with these 3 options: keeping old packages from the NVIDIA repo, setting up multikernel in the zypper config, and downgrading via package history in case multikernel somehow messes up.
Now the only open question for me is: How do I fix the current situation, where I don't have a cached version of the NVIDIA driver? Unfortunately it seems there is no such version history for the NVIDIA repository. Is there still some way to get my hands on an older version (preferably without having to do the installation 'the hard way')?
Thanks a lot! Joe
I haven't used it, but google reveals: https://www.nvidia.com/en-us/drivers/unix/linux-amd64-display-archive/ I guess they could be installed "the hard way". Michael
Update for everyone else relying on NVenc: I was lucky enough to obtain the packages for the previous NVIDIA driver version 460.73 and the issue is already present in that version, too. Also, I was able to confirm that the problem is not related to kernel version 5.12, as 5.11 exhibits the same problem. I guess at this point it would make sense to file a bug report @NVIDIA, but ideally I would want to first confirm the latest working version which it seems I can only do 'the hard way'.
Update of update: I was wrong. From the beginning. The issue with NVenc is neither with the kernel nor with the NVIDIA driver. What a strange coincidence that 2 other programs I use with CUDA also have problems that are apparently completely unrelated to the driver. After many hours of trying different kernels and drivers I despaired and decided to rollback to my snapshot from November 2020 (kernel 5.9). The issue was gone. After some careful upgrading of single packages and checking if the issue reappeared I have now narrowed it down to some 50 packages from packman. I believe the culprit is most likely ffmpeg or some library it depends on. If you want to check if you are also affected, try issuing this command on some video file: ffmpeg -hwaccel cuda -hwaccel_output_format cuda -i video.mkv -c:v h264_nvenc -b:v 5M test.mp4 If you are affected ffmpeg should complain about some cuda error if I remember correctly, if not it should just convert the video without problems. Nevertheless I now feel more confident in dealing with NVIDIA issues, so thanks everyone for the kind help!
Final update: It seems there's two different issues here: One with ffmpeg: My error message looks like this: [h264_nvenc @ 0x5557207b3f00] Lossless encoding not supported [h264_nvenc @ 0x5557207b3f00] Provided device doesn't support required NVENC features Error initializing output stream 0:0 -- Error while opening encoder for output stream #0:0 - maybe incorrect parameters such as bit_rate, rate, width or height [aac @ 0x5557207ebf80] Qavg: 15661.693 [aac @ 0x5557207ebf80] 2 frames left in the queue on closing Conversion failed! This hints at this recently fixed bug, which seems to be hardware-specific: https://fftrac-bg.ffmpeg.org/ticket/9165 So far so good, let's hope it's really fixed in some future version. The other problem is definitely with OBS-Studio. While the old 26.0.2-3.2 from packman works fine, every newer version I could try (either from OBS – the open build service – or from packman) gives terrible encoding lags when using the NVenc encoder. I guess for this problem I have to file a bug at OBS Studio directly... Joe
That's interesting stuff. I also dabble with ffmpeg and "obs-studio" [info: OBS 26.1.2-2 (linux)] related to Twitch. Not that I'm a "streamer", but I enjoy goofing around with the technology. I was trying to live-encode things with a raspberry pi to watch streams via a chromecast, but it generally proved to be too intense for the 4GB Pi4 (even using omx/omx-rpi) so I just use a combo of youtube-dl & ffmpeg to "-copy" VODs in 30m chunks to watch them 'on-demand'; I'm not an interactive Twitch person, so syncing up w/ streamers and chat isn't really a thing for me. I found your ffmpeg example command here: https://docs.nvidia.com/video-technologies/video-codec-sdk/ffmpeg-with-nvidi... and attempted to experiment with it using Arch's provided ffmpeg and one that I'd just compiled from source. Neither work using that command as it's laid out, I can simply use "-hwaccel cuda" and NOT have "-hwaccel_output_format cuda" present and it completes w/o massive errors. What that changes, I'm not 100% sure and it might not make a difference unless people compare converting the same video files. Back to obs-studio, I already have it on my box w/ the 2070 super in it and things seem fine there, at least if I have a terminal up and run "obs" from within - it seems to be using NVENC w/o issue. This was the case for Arch's /usr/bin/ffmpeg and my git /usr/local/bin/ffmpeg install (adjusting PATH and LD_LIBRARY_PATH). Firing it up I see this: ... info: Loading up OpenGL on adapter NVIDIA Corporation GeForce RTX 2070 SUPER/PCIe/SSE2 info: OpenGL loaded successfully, version 3.3.0 NVIDIA 460.80, shading language 3.30 NVIDIA via Cg compiler ... info: NVENC supported I can record a video of my webcam and browser (w/ a YT video playing in it) w/o issue and it seems just fine. Again, I don't know if this _proves_ anything positive or negative, but I'm always interested in learning a bit more about this stuff and the expectations around how it's supposed to work compared to the expectations of others (who know more about it than I do).
I dug into this a little more, using Tumbleweed this time (and obs installed through packman, per the directions from obs studio site for OpenSUSE) - since that seems more useful to this discussion. This time with the following setup: ... info: Loading up OpenGL on adapter NVIDIA Corporation NVIDIA Quadro T1000/PCIe/SSE2 info: OpenGL loaded successfully, version 3.3.0 NVIDIA 465.19.01, shading language 3.30 NVIDIA via Cg compiler ... The build for ffmpeg doesn't completely seem necessary, the command I used for /usr/bin/ffmpeg vs. /usr/local/bin/ffmpeg seems to produce the same result(s). I would get a "[h264 @ 0x558c36c64ec0] No decoder surfaces left" error with "-hwaccel_output_format cuda" unless I add "-extra_hw_frames 2" - in some cases I had to bump it up to "3" - to have that error go away. Seems I can leave that out entirely if I use "hevc_nvenc" vs. "h264_nvenc". I also properly (I think) installed the v11.3.1 SDK, something I certainly didn't go in Arch. This took me from 465.31 down to 465.19.01, but I didn't see a version available for 465.31. https://developer.nvidia.com/cuda-downloads So I don't know what the most important aspect to this is - maybe the CUDA SDK? - but it _seems_ like it's working in some capacity.
I thought I'd share my experience, in case it helps - but moreso just to raise another hand in the Nvidia + TW struggle. I'm currently using kernel 5.12.4-1 and Nvidia 465.31 ... I've fought with this since 5.12.X came out (https://bugzilla.opensuse.org/show_bug.cgi?id=1185601 describes my situation), I wasn't using anything Nvidia until 5.12 rendered my display (be it via i915 or nouveau) non-functional and I thought maybe the Nvidia drivers would solve the problem. I've decided that I won't be using the Nvidia repo as it seems to be tied to a specific kernel (5.12.0-2) that can't be guaranteed as the up-to-date one for anyone who does 'zypper dup' on a daily basis. I only had success with the repo by installing off a USB flash drive (no network enabled at install or not enabling of online repos) which happened to have 5.11.6 on it (that was the latest TW kernel the day I put it on the flash drive) and then version locking kernel-default.x86_64/ kernel-default-devel.x86_64 / kernel-devel.noarch / kernel-macros.noarch @ 5.11.6 and installing off the Nvidia repo at that kernel level; I feel like this only worked because I was under 5.12.0-2 (which seems to be hard-coded as the max version the repo drivers will install for). Long story (which I won't bore anyone reading this with) short, I'm opting to keep an as-up-to-date-as-possible NVIDIA*.run file (from https://download.nvidia.com/XFree86/Linux-x86_64) saved on disk and as kernels are updated, I'll run it against them - if it doesn't work, I'll keep my eyes peeled for a more up-to-date NVIDIA-*.run. I can easily tell when things aren't working in 5.12.X (with nouveau blacklisted) as I never get a lightdm login screen and I get a blinking "_" under "Starting Login Service". I then Alt-F4, login and run NVIDIA-*.run against that kernel, reboot and 99.9/100 times I'm back in business. I was also keeping an eye on /usr/src/linux and making sure that symlink was pointing to the kernel I was interested in. A few times it wasn't and I'm not sure if that hurt my efforts in any cases. I was hoping a 'dracut --force --kver [TAB for kernel]' would allow me to have CURRENT and PREVIOUS both using Nvidia drivers, but I've come to the determination that I'll just have 1 that works - I can choose which one, but NO OTHER KERNEL will have working Nvidia drivers other than the one I've chosen. Basically, I was able to get 5.12.4 (with 465.31) to work ; I was able to get 5.11.16 (with 465.31/460.80) to work ; it was impossible to get 5.12.4 AND 5.11.16 to both work even when specifying one of them specifically via dracut (vs "mkinitrd" for everything). I'm sure it has something to do w/ the NVIDIA-*.run file basically wiping anything that's not the current one you're building against - I'm sure that's what the initial disclaimer is hinting at.
On Monday, 24 May 2021 5:03:50 AM ACST Scott Bradnick wrote:
[...] I was hoping a 'dracut --force --kver [TAB for kernel]' would allow me to have CURRENT and PREVIOUS both using Nvidia drivers, but I've come to the determination that I'll just have 1 that works - I can choose which one, but NO OTHER KERNEL will have working Nvidia drivers other than the one I've chosen. Basically, I was able to get 5.12.4 (with 465.31) to work ; I was able to get 5.11.16 (with 465.31/460.80) to work ; it was impossible to get 5.12.4 AND 5.11.16 to both work even when specifying one of them specifically via dracut (vs "mkinitrd" for everything). I'm sure it has something to do w/ the NVIDIA-*.run file basically wiping anything that's not the current one you're building against - I'm sure that's what the initial disclaimer is hinting at.
I've had this working since at least version oS 13.0 (or earler) using dkms to build and install the modules for alternative kernels. It can either be triggered by the cli or happen automatically at boot (but you have to install and enable dkms first). Unfortunately this has broken with 5.12.4 - I cannot get either 460.80 or 465.31 to build successfully against 5.12.4 with the latest (as of yesterday) TW snapshot - it appears that something has changed with gcc config and the build now fails with compile errors against multiple source files (at least on my system, anyway). I did get 465.31 to build OK against 5.11.15-1-default, so I'm back on that one until the issue with 512.4 is resolved. I don't know if there's an open bug report with Nvidia on that - there wasn't when I checked on the weekend. Regards, -- ================================================================================================================== Rodney Baker rodney.baker@iinet.net.au ==================================================================================================================
On Tuesday, 25 May 2021 12:43:44 AM ACST Rodney Baker wrote:
On Monday, 24 May 2021 5:03:50 AM ACST Scott Bradnick wrote:
[...]
I was hoping a 'dracut --force --kver [TAB for kernel]' would allow me to have CURRENT and PREVIOUS both using Nvidia drivers, but I've come to the determination that I'll just have 1 that works - I can choose which one, but NO OTHER KERNEL will have working Nvidia drivers other than the one I've chosen. Basically, I was able to get 5.12.4 (with 465.31) to work ; I was able to get 5.11.16 (with 465.31/460.80) to work ; it was impossible to get 5.12.4 AND 5.11.16 to both work even when specifying one of them specifically via dracut (vs "mkinitrd" for everything). I'm sure it has something to do w/ the NVIDIA-*.run file basically wiping anything that's not the current one you're building against - I'm sure that's what the initial disclaimer is hinting at.
I've had this working since at least version oS 13.0 (or earler) using dkms to build and install the modules for alternative kernels. It can either be triggered by the cli or happen automatically at boot (but you have to install and enable dkms first).
Unfortunately this has broken with 5.12.4 - I cannot get either 460.80 or 465.31 to build successfully against 5.12.4 with the latest (as of yesterday) TW snapshot - it appears that something has changed with gcc config and the build now fails with compile errors against multiple source files (at least on my system, anyway).
I did get 465.31 to build OK against 5.11.15-1-default, so I'm back on that one until the issue with 512.4 is resolved. I don't know if there's an open bug report with Nvidia on that - there wasn't when I checked on the weekend.
Regards,
Oh, yes, with the gcc version update, it also failed the compiler version check. To get it to build on 5.11.15-1-default I had to run the installer with the --no-cc-version-check option - then it built successfully. If one of your kernels has been built with a different version of gcc to what's installed on your system, that may be part of the issue (don't know - just a suggestion). -- ================================================================================================================== Rodney Baker rodney.baker@iinet.net.au ==================================================================================================================
That makes sense, dkms is something I've used sparingly in the past because I haven't developed the useful habit of multiple kernels with proprietary video drivers, mostly out of laziness and gaming in Windows. I do have a desktop w/ a 2070 Super in it running Manjaro and Nvidia drivers, but that's done via some other mechanism outside of dkms (I'm using a package named "linux510-nvidia 460.80-1"; and I believe I haven't updated the kernel's 5.X release on this box in quite a while - I recall Manjaro has a different kernel management interface than Arch, so I didn't keep up with it after going from v5.4 -> 5.10). Other than that, the only dkms module I have is wireguard in v5.4. Great suggestion though, dkms will be the avenue I go down if I need Nvidia working in multiple kernels (I'll most likely stay on 5.12.4 [or the latest, so long as it working]) or want to experiment again to further ease management or build up my understanding about kernels and Nvidia playing nice.
participants (8)
-
Joe Kays
-
Kays
-
Michael Hamilton
-
Michael Hamilton
-
Rainer Klier
-
Rhaytana
-
Rodney Baker
-
Scott Bradnick