GPU

older
6.12 is out for TW, but not nvidia...

Bill Swisher

1 Jan 2025 1 Jan '25

23:14

I throw a lot of hardware at boinc. I'm running the 7.2.41 version of the boinc-client. It tells me there's a 8.0.2 version available, but it isn't out there for opensSUSE. In any event it doesn't pick up my old GPU. Or at least I think I have a GPU. Here's what I've gleaned so far: From the boinc-client log via Yast Services Manager ...GPU detection failed: process was terminated by signal 32 ...read_coproc_info_file() returned error -108 ...No usable GPUs found ============================================== sudo dmesg | grep -i gpu [ 3.996578] [drm] [nvidia-drm] [GPU ID 0x00000600] Loading driver [ 5.354894] nvidia-modeset: WARNING: GPU:0: Unable to read EDID for display device SAMSUNG (HDMI-0) [ 5.385052] nvidia-modeset: WARNING: GPU:0: Unable to read EDID for display device HDMI-0 [ 19.298032] nvidia-gpu 0000:06:00.3: i2c timeout error e0000000 ============================================== Snippets from inxi -F Kernel: 6.4.0-150600.23.30-default AMD Ryzen 9 3900X Graphics: Device-1: NVIDIA TU104 [GeForce RTX 2060] driver: nvidia v: 550.135 I have an even older Threadripper 2950X, with a GeForce GT 1030, which does essentially the same thing, and 3 newer Ryzen 5700G which, while giving different messages, don't show a working GPU with the boinc-client. I haven't found a place where I can address this with whoever handles the creation of the boinc-client for openSUSE, find newer software (that the client is telling me is out there) or adjust the settings (if there are any). Just wondering what my next step is.

Show replies by date

Darryl Gregorash

5 Jan 5 Jan

17:08

On 2025-01-01 17:14, Bill Swisher wrote:

...

I throw a lot of hardware at boinc. I'm running the 7.2.41 version of the boinc-client. It tells me there's a 8.0.2 version available, but it isn't out there for opensSUSE. This is not a boinc error. Except for the GPU (I have an RTX 3050), my system is identical to yours. It is working just fine.

...

In any event it doesn't pick up my old GPU. Or at least I think I have a GPU. Here's what I've gleaned so far: Of course you have a GPU. If you didn't, you wouldn't have any output on your monitor. In fact, I think most modern motherboards won't even boot if the video card is missing.

...

From the boinc-client log via Yast Services Manager

Say what? First off, you can't view any files from the services manager. All you can do is start/stop services, and configure them to start on boot, or on demand. The boinc log file you want to look at is /var/lib/boinc/stdoutdae.txt. It is created every time boinc is started. The first few lines should look like this: 27-Dec-2024 16:51:43 [---] Starting BOINC client version 7.24.1 for x86_64-suse-linux-gnu 27-Dec-2024 16:51:43 [---] log flags: file_xfer, sched_ops, task, cpu_sched 27-Dec-2024 16:51:43 [---] Libraries: libcurl/8.6.0 OpenSSL/3.1.4 zlib/1.2.13 brotli/1.0.7 zstd/1.5.5 libidn2/2.2.0 libpsl/0.20.1 libssh/0.9.8/openssl/zlib nghttp2/1.40.0 OpenLDAP/2.4.46 27-Dec-2024 16:51:43 [---] Data directory: /var/lib/boinc 27-Dec-2024 16:51:44 [---] CUDA: NVIDIA GPU 0: NVIDIA GeForce RTX 3050 (driver version 550.99, CUDA version 12.4, compute capability 8.6, 7966MB, 7966MB available, 9329 GFLOPS peak) 27-Dec-2024 16:51:44 [---] OpenCL: NVIDIA GPU 0: NVIDIA GeForce RTX 3050 (driver version 550.135, device version OpenCL 3.0 CUDA, 7966MB, 7966MB available, 9329 GFLOPS peak) 27-Dec-2024 16:51:44 [---] libc: version 2.38 27-Dec-2024 16:51:44 [---] Host name: hadron This is why I say this is not a boinc error:

...

sudo dmesg | grep -i gpu [ 3.996578] [drm] [nvidia-drm] [GPU ID 0x00000600] Loading driver [ 5.354894] nvidia-modeset: WARNING: GPU:0: Unable to read EDID for display device SAMSUNG (HDMI-0) [ 5.385052] nvidia-modeset: WARNING: GPU:0: Unable to read EDID for display device HDMI-0 [ 19.298032] nvidia-gpu 0000:06:00.3: i2c timeout error e0000000

That is all happening during boot, not when boinc is starting up. Because this is happening on several systems with different GPUs installed, it suggests to me that there is something wrong with the nvidia package(s) you installed. Which repo did you get them from? I suggest using https://download.nvidia.com/opensuse/leap/$releasever ($releasever is a variable in the system environment, whose value by default is the release version currently running on the system, in this case 15.6). Open up the YaST repository manager, highlight the NVidia repo, and if necessary edit the URL to the above. Save, and exit. Now start up the Software manager, open the Repositories tab and find/select the NVidia repo. Right click in the right-hand panel, where all the packages are listed. In the popup, scroll down to "All in this list", then select "Update unconditionally". Click on "Accept" to re-install the NVidia drivers. Once done, you will need to reboot the system. Then rerun "sudo dmesg|grep -i gpu" to see if it is working now. If it is, you should only get one line of output: 6.492015] [drm] [nvidia-drm] [GPU ID 0x00000400] Loading driver This should work on the system with the RTX 2060. I don't know if it will work with the GT 1030, but I think it should. It will definitely NOT work with the 5700G systems, because those are from AMD. You will need to find the proper drivers from AMD -- but good luck getting them to work. I picked up an AMD video card and spent nearly 2 weeks trying to get it to work with boinc; then someone convinced me to go NVidia, so I returned the AMD card and had the NVidia replacement up and running within 2 hours -- most of which was spent reading so I got the procedure right.

Bill Swisher

21:14

On 1/5/25 10:08, Darryl Gregorash wrote:

...

...
From the boinc-client log via Yast Services Manager Say what? First off, you can't view any files from the services manager. All you can do is start/stop services, and configure them to start on boot, or on demand.

Thanks for the reply. To continue, Start Yast, Services Manager, click on boinc-client (just to highlight it), there are options to "Show Details" and "Show Log". What I wrote came from the log entry.

...

The boinc log file you want to look at is /var/lib/boinc/stdoutdae.txt. It is created every time boinc is started.

I have the files stderrgpudetect.txt and stdoutgpudetect.txt The first file is full of lines reading: NVIDIA drivers present but no GPUs found ATI: libaticalrt.so: cannot open shared object file: No such file or directory clGetPlatformIDs() failed to return any OpenCL platforms The second is full, like 4 years worth, of lines reading: 03-Jan-2025 13:52:42 [---] cc_config.xml not found - using defaults

...

The boinc log file you want to look at is /var/lib/boinc/stdoutdae.txt. It is created every time boinc is started. The first few lines should look like this:

No such file. I ran "grep -iR libcurl *" and "grep -iR ssl *" and found nothing like your example...

...

27-Dec-2024 16:51:43 [---] Libraries: libcurl/8.6.0 OpenSSL/3.1.4 zlib/1.2.13 brotli/1.0.7 zstd/1.5.5 libidn2/2.2.0 libpsl/0.20.1 libssh/0.9.8/openssl/zlib nghttp2/1.40.0 OpenLDAP/2.4.46

But this was found: bill@lusca:~> sudo journalctl --since "2025-01-05 13:20:04" --until "2025-01-05 13:20:17" | grep -i boinc Jan 05 13:20:04 lusca boinc[1906]: 05-Jan-2025 13:20:04 [---] Starting BOINC client version 7.24.1 for x86_64-suse-linux-gnu Jan 05 13:20:04 lusca boinc[1906]: 05-Jan-2025 13:20:04 [---] log flags: file_xfer, sched_ops, task Jan 05 13:20:05 lusca boinc[1906]: 05-Jan-2025 13:20:05 [---] Libraries: libcurl/8.6.0 OpenSSL/3.1.4 zlib/1.2.13 brotli/1.0.7 zstd/1.5.5 libidn2/2.2.0 libpsl/0.20.1 libssh/0.9.8/openssl/zlib nghttp2/1.40.0 OpenLDAP/2.4.46 Jan 05 13:20:05 lusca boinc[1906]: 05-Jan-2025 13:20:05 [---] Data directory: /var/lib/boinc Jan 05 13:20:16 lusca boinc[1906]: 05-Jan-2025 13:20:16 [---] GPU detection failed: process exited with status 82: Attempting to link in too many shared libraries Jan 05 13:20:16 lusca boinc[1906]: read_coproc_info_file() returned error -108 Jan 05 13:20:16 lusca boinc[1906]: 05-Jan-2025 13:20:16 [---] No usable GPUs found

...

Which repo did you get them from? I suggest using https://download.nvidia.com/opensuse/leap/$releasever

It's there and enabled.

...

Now start up the Software manager, open the Repositories tab and find/select the NVidia repo. Right click in the right-hand panel, where all the packages are listed. In the popup, scroll down to "All in this list", then select "Update unconditionally". Click on "Accept" to re-install the NVidia drivers.

OK, did this and did a reboot. Everything is the same, with the exception of that line about too many libraries, which is new. Meaning it wasn't there yesterday, or earlier today for that matter.

...

the 5700G systems

Not a major problem. These little critters I put together because...well they're small physically (I carry one around with me in my carry on bag and use it as my desktop where ever I am), only has 3 moving parts (on/off switch and 2 fans), max out at 200W and are pretty cheap (the last 2 I put together cost about $475US each, that has gone up).

Darryl Gregorash

9 Jan 9 Jan

17:01

On 2025-01-05 15:14, Bill Swisher wrote:

...

On 1/5/25 10:08, Darryl Gregorash wrote:

...
...
From the boinc-client log via Yast Services Manager Say what? ....

Thanks for the reply. To continue, Start Yast, Services Manager, click on boinc-client (just to highlight it), there are options to "Show Details" and "Show Log". What I wrote came from the log entry.

Wow, I don't think I ever even saw that before now. Thanks for pointing it out, though there isn't very much in that log, whatever/wherever it is.

...

I have the files stderrgpudetect.txt and stdoutgpudetect.txt The first file is full of lines reading: These are of no importance to us here. We already know, from the dmesg snippet, that the nvidia drivers are failing to load properly.

...

...
The boinc log file you want to look at is /var/lib/boinc/ stdoutdae.txt. It is created every time boinc is started. The first few lines should look like this:

No such file. <snip>

OK, you are not running boinc as a systemd service. If you were, that file would exist, along with /var/lib/boinc/stderrdae.txt

...

But this was found: bill@lusca:~> sudo journalctl --since "2025-01-05 13:20:04" --until "2025-01-05 13:20:17" | grep -i boinc Jan 05 13:20:04 lusca boinc[1906]: 05-Jan-2025 13:20:04 [---] Starting BOINC client version 7.24.1 for x86_64-suse-linux-gnu Jan 05 13:20:04 lusca boinc[1906]: 05-Jan-2025 13:20:04 [---] log flags: file_xfer, sched_ops, task Jan 05 13:20:05 lusca boinc[1906]: 05-Jan-2025 13:20:05 [---] Libraries: libcurl/8.6.0 OpenSSL/3.1.4 zlib/1.2.13 brotli/1.0.7 zstd/1.5.5 libidn2/2.2.0 libpsl/0.20.1 libssh/0.9.8/openssl/zlib nghttp2/1.40.0 OpenLDAP/2.4.46 Jan 05 13:20:05 lusca boinc[1906]: 05-Jan-2025 13:20:05 [---] Data directory: /var/lib/boinc

Compare those 4 lines with the first 4 lines of what I posted. Anyway, none of this is important in solving the problem of the GPU error. I haven't made much progress on this. I guess next it's off to the NVidia forums to see if there is anything there. I'll start a new sub-thread for the hardware stuff.

...

Jan 05 13:20:16 lusca boinc[1906]: 05-Jan-2025 13:20:16 [---] GPU detection failed: process exited with status 82: Attempting to link in too many shared libraries Jan 05 13:20:16 lusca boinc[1906]: read_coproc_info_file() returned error -108 Jan 05 13:20:16 lusca boinc[1906]: 05-Jan-2025 13:20:16 [---] No usable GPUs found

...
Which repo did you get them from? I suggest using https://download.nvidia.com/opensuse/leap/$releasever

It's there and enabled.

...
Now start up the Software manager, open the Repositories tab and find/ select the NVidia repo. Right click in the right-hand panel, where all the packages are listed. In the popup, scroll down to "All in this list", then select "Update unconditionally". Click on "Accept" to re- install the NVidia drivers.

OK, did this and did a reboot. Everything is the same, with the exception of that line about too many libraries, which is new. Meaning it wasn't there yesterday, or earlier today for that matter.

...
the 5700G systems

Not a major problem. These little critters I put together because...well they're small physically (I carry one around with me in my carry on bag and use it as my desktop where ever I am), only has 3 moving parts (on/ off switch and 2 fans), max out at 200W and are pretty cheap (the last 2 I put together cost about $475US each, that has gone up).

Darryl Gregorash

11 Jan 11 Jan

21:00

On 2025-01-01 17:14, Bill Swisher wrote:

...

<snip>

I checked on the NVidia website, for both the 2060 and 1030 chips. The 550.xxx drivers will work with both, leading me to suspect even more strongly that you initially installed the drivers from a corrupted source -- something which the unconditional update did not correct. Before we proceed to major surgery on your systems, it would be helpful to know if this is a new problem; that is, were the systems working before, and suddenly stopped working? If so, what is the last system update you did before everything stopped working?

...

sudo dmesg | grep -i gpu

The results of "sudo dmesg|grep -i nvidia" will probably be more useful here. Please show these for both the systems with a NVidia GPU.

Bill Swisher

22:08

On 1/11/25 14:00, Darryl Gregorash wrote:

...

I checked on the NVidia website, for both the 2060 and 1030 chips. The 550.xxx drivers will work with both, leading me to suspect even more strongly that you initially installed the drivers from a corrupted source -- something which the unconditional update did not correct.

Pfffttt...one machine is 7 years old and the other a mere 5. Who knows what happened that far in the past. But I don't install 3rd party software, meaning everything comes from a recognized repository.

...

Before we proceed to major surgery on your systems, it would be helpful to know if this is a new problem; that is, were the systems working before, and suddenly stopped working?

Hard to say. I just had a urge to look into whether or not the GPU was working. Truthfully both machines just sit there, running boinc (100% CPU utilization 24X365), act as NFS servers in their spare time, and generate heat.

...

...
sudo dmesg | grep -i gpu

The closest machine: bill@lusca:~> sudo dmesg | grep -i nvidia [ 1.977818] nvidia: loading out-of-tree module taints kernel. [ 1.977827] nvidia: module license 'NVIDIA' taints kernel. [ 1.977832] nvidia: module verification failed: signature and/or required key missing - tainting kernel [ 1.977833] nvidia: module license taints kernel. [ 2.092983] nvidia-nvlink: Nvlink Core is being initialized, major device number 241 [ 2.094654] nvidia 0000:06:00.0: vgaarb: VGA decodes changed: olddecodes=io+mem,decodes=none:owns=io+mem [ 2.143386] NVRM: loading NVIDIA UNIX x86_64 Kernel Module 550.135 Thu Nov 14 00:09:29 UTC 2024 [ 3.816308] nvidia_uvm: module uses symbols nvUvmInterfaceDisableAccessCntr from proprietary module nvidia, inheriting taint. [ 3.957318] nvidia-uvm: Loaded the UVM driver, major device number 239. [ 4.105648] nvidia-modeset: Loading NVIDIA Kernel Mode Setting Driver for UNIX platforms 550.135 Wed Nov 13 23:35:58 UTC 2024 [ 4.107786] [drm] [nvidia-drm] [GPU ID 0x00000600] Loading driver [ 4.887261] [drm] Initialized nvidia-drm 0.0.0 20160202 for 0000:06:00.0 on minor 0 [ 4.887504] nvidia 0000:06:00.0: vgaarb: deactivate vga console [ 4.887664] nvidia 0000:06:00.0: [drm] Cannot find any crtc or sizes [ 21.235832] input: HDA NVidia HDMI/DP,pcm=3 as /devices/pci0000:00/0000:00:03.1/0000:06:00.1/sound/card0/input18 [ 21.235965] input: HDA NVidia HDMI/DP,pcm=7 as /devices/pci0000:00/0000:00:03.1/0000:06:00.1/sound/card0/input19 [ 21.237422] input: HDA NVidia HDMI/DP,pcm=8 as /devices/pci0000:00/0000:00:03.1/0000:06:00.1/sound/card0/input20 [ 21.237564] input: HDA NVidia HDMI/DP,pcm=9 as /devices/pci0000:00/0000:00:03.1/0000:06:00.1/sound/card0/input21 [ 21.812014] nvidia-gpu 0000:06:00.3: i2c timeout error e0000000 [ 23.328153] audit: type=1400 audit(1736351820.860:6): apparmor="STATUS" operation="profile_load" profile="unconfined" name="nvidia_modprobe" pid=1200 comm="apparmor_parser" [ 23.328159] audit: type=1400 audit(1736351820.860:7): apparmor="STATUS" operation="profile_load" profile="unconfined" name="nvidia_modprobe//kmod" pid=1200 comm="apparmor_parser" And the further one, which I would be loath to fiddle with: bill@kraken:~> sudo dmesg | grep -i nvidia [ 2.353075] nvidia: loading out-of-tree module taints kernel. [ 2.353095] nvidia: module license 'NVIDIA' taints kernel. [ 2.353102] nvidia: module verification failed: signature and/or required key missing - tainting kernel [ 2.353105] nvidia: module license taints kernel. [ 2.473720] nvidia-nvlink: Nvlink Core is being initialized, major device number 241 [ 2.475573] nvidia 0000:41:00.0: vgaarb: VGA decodes changed: olddecodes=io+mem,decodes=none:owns=io+mem [ 2.591459] NVRM: loading NVIDIA UNIX x86_64 Kernel Module 550.135 Thu Nov 14 00:09:29 UTC 2024 [ 5.885821] nvidia_uvm: module uses symbols nvUvmInterfaceDisableAccessCntr from proprietary module nvidia, inheriting taint. [ 6.050572] nvidia-uvm: Loaded the UVM driver, major device number 239. [ 6.180739] nvidia-modeset: Loading NVIDIA Kernel Mode Setting Driver for UNIX platforms 550.135 Wed Nov 13 23:35:58 UTC 2024 [ 6.183845] [drm] [nvidia-drm] [GPU ID 0x00004100] Loading driver [ 7.033062] [drm] Initialized nvidia-drm 0.0.0 20160202 for 0000:41:00.0 on minor 0 [ 7.033219] nvidia 0000:41:00.0: vgaarb: deactivate vga console [ 7.079866] fbcon: nvidia-drmdrmfb (fb0) is primary device [ 7.152251] nvidia 0000:41:00.0: [drm] fb0: nvidia-drmdrmfb frame buffer device [ 13.369556] input: HDA NVidia HDMI/DP,pcm=3 as /devices/pci0000:40/0000:40:03.1/0000:41:00.1/sound/card1/input17 [ 13.369653] input: HDA NVidia HDMI/DP,pcm=7 as /devices/pci0000:40/0000:40:03.1/0000:41:00.1/sound/card1/input18 [ 13.369719] input: HDA NVidia HDMI/DP,pcm=8 as /devices/pci0000:40/0000:40:03.1/0000:41:00.1/sound/card1/input19 [ 13.369806] input: HDA NVidia HDMI/DP,pcm=9 as /devices/pci0000:40/0000:40:03.1/0000:41:00.1/sound/card1/input20 [ 15.438400] audit: type=1400 audit(1736607344.783:7): apparmor="STATUS" operation="profile_load" profile="unconfined" name="nvidia_modprobe" pid=1409 comm="apparmor_parser" [ 15.438407] audit: type=1400 audit(1736607344.783:8): apparmor="STATUS" operation="profile_load" profile="unconfined" name="nvidia_modprobe//kmod" pid=1409 comm="apparmor_parser"

Darryl Gregorash

12 Jan 12 Jan

00:01

On 2025-01-11 16:08, Bill Swisher wrote:

...

On 1/11/25 14:00, Darryl Gregorash wrote:

...
I checked on the NVidia website, for both the 2060 and 1030 chips. The 550.xxx drivers will work with both, leading me to suspect even more strongly that you initially installed the drivers from a corrupted source -- something which the unconditional update did not correct.

Pfffttt...one machine is 7 years old and the other a mere 5. Who knows what happened that far in the past. But I don't install 3rd party software, meaning everything comes from a recognized repository. The actual repository you use is not relevant; you are actually pulling packages in off a mirror of that repository. It is possible (remotely) that the mirror you were sent to was not fully updated when you did the most recent update to the NVidia packages.

...
...
sudo dmesg | grep -i gpu

The closest machine: bill@lusca:~> sudo dmesg | grep -i nvidia [ 1.977818] nvidia: loading out-of-tree module taints kernel. [ 1.977827] nvidia: module license 'NVIDIA' taints kernel. [ 1.977832] nvidia: module verification failed: signature and/or required key missing - tainting kernel [ 1.977833] nvidia: module license taints kernel. [ 2.092983] nvidia-nvlink: Nvlink Core is being initialized, major device number 241 [ 2.094654] nvidia 0000:06:00.0: vgaarb: VGA decodes changed: olddecodes=io+mem,decodes=none:owns=io+mem [ 2.143386] NVRM: loading NVIDIA UNIX x86_64 Kernel Module 550.135 Thu Nov 14 00:09:29 UTC 2024 [ 3.816308] nvidia_uvm: module uses symbols nvUvmInterfaceDisableAccessCntr from proprietary module nvidia, inheriting taint. [ 3.957318] nvidia-uvm: Loaded the UVM driver, major device number 239. [ 4.105648] nvidia-modeset: Loading NVIDIA Kernel Mode Setting Driver for UNIX platforms 550.135 Wed Nov 13 23:35:58 UTC 2024 [ 4.107786] [drm] [nvidia-drm] [GPU ID 0x00000600] Loading driver [ 4.887261] [drm] Initialized nvidia-drm 0.0.0 20160202 for 0000:06:00.0 on minor 0 [ 4.887504] nvidia 0000:06:00.0: vgaarb: deactivate vga console [ 4.887664] nvidia 0000:06:00.0: [drm] Cannot find any crtc or sizes

Two things I noticed: 1. All of the above is exactly the same as on my system when I run the same command. 2. Before the unconditional update, you were seeing these two entries when you ran "dmesg|grep -i gpu":

...

[ 5.354894] nvidia-modeset: WARNING: GPU:0: Unable to read EDID for display device SAMSUNG (HDMI-0) [ 5.385052] nvidia-modeset: WARNING: GPU:0: Unable to read EDID for display device HDMI-0 These would also show up with "dmesg|grep -i nvidia", but are not present in the above.

Could you please stop/start boinc (use the actual stop and start commands, not restart or reload)? Then check the boinc log to see if there is still a line near the top saying "Attempting to link in too many shared libraries"

...

And the further one, which I would be loath to fiddle with:

Leave that one alone until you get home. It's not going anywhere.

Bill Swisher

15:47

On 1/11/25 17:01, Darryl Gregorash wrote:

...

Two things I noticed: 1. All of the above is exactly the same as on my system when I run the same command. 2. Before the unconditional update, you were seeing these two entries when you ran "dmesg|grep -i gpu":

...
[ 5.354894] nvidia-modeset: WARNING: GPU:0: Unable to read EDID for display device SAMSUNG (HDMI-0) [ 5.385052] nvidia-modeset: WARNING: GPU:0: Unable to read EDID for display device HDMI-0 These would also show up with "dmesg|grep -i nvidia", but are not present in the above.

Both computers are running essentially headless. There's a "monitor" attached, sorta. The local machine uses a HDMI switch to connect the 4 computers to what is actually a TV, but seldom used that way. I ssh over to it when I need to change something. The distant machine is physically connected to a monitor that's powered down and is pretty much handled the same way. Lots of turned off wireless keyboards/mice laying about.

...

"Attempting to link in too many shared libraries"

Nope, that appears to have been a one-time entry. I've stopped things and rebooted. Then I stopped things, powered down and booted. It never reappeared.

Age (days ago)

Last active (days ago)

List overview

Download

7 comments

2 participants

participants (2)

Bill Swisher
Darryl Gregorash

GPU

tags

participants (2)