Bug ID 1231312
Summary jc42 not retaining devices after reboot, related i2c issues
Classification openSUSE
Product openSUSE Tumbleweed
Version Current
Hardware x86-64
OS openSUSE Tumbleweed
Status NEW
Severity Normal
Priority P5 - None
Component Kernel:Drivers
Assignee kernel-bugs@suse.de
Reporter pallaswept@proton.me
QA Contact qa-bugs@suse.de
Target Milestone ---
Found By ---
Blocker ---

I noticed today that since kernel 6.11 my DDR4 temperature sensors were absent
from applications like `sensors` or CoolerControl. 

The jc42 kernel module which normally provides these hwmon sensors is loaded
and there are no errors in the journal searching for jc42, although there is an
absence of usual log messages from coolercontrol where it lists the devices it
has derived from hwmon.

Looking at my sensors conf file and configuration logs, I know that the
controller for these devices was i2c-6. I notice that this is now occupied by
the nvidia GPU:

> sudo i2cdetect -l
i2c-0   i2c             Synopsys DesignWare I2C adapter         I2C adapter
i2c-1   smbus           SMBus PIIX4 adapter port 0 at 0b00      SMBus adapter
i2c-2   smbus           SMBus PIIX4 adapter port 2 at 0b00      SMBus adapter
i2c-3   smbus           SMBus PIIX4 adapter port 1 at 0b20      SMBus adapter
i2c-4   i2c             NVIDIA i2c adapter 1 at 7:00.0          I2C adapter
i2c-5   i2c             NVIDIA i2c adapter 5 at 7:00.0          I2C adapter
i2c-6   i2c             NVIDIA i2c adapter 6 at 7:00.0          I2C adapter
i2c-7   i2c             NVIDIA i2c adapter 7 at 7:00.0          I2C adapter
i2c-8   i2c             NVIDIA i2c adapter 8 at 7:00.0          I2C adapter

Looking at the filesystem I can see that the jc42 devices are not present on
any controller.

If I add new devices, they will function seemingly normally, but upon reboot,
they are lost. I can re-add them and they will function until reboot. 

Rebooting from 6.11.0 where the devices are not present, having been
'forgotten' from the previous reboot, and the controller is on i2c-1, into
6.10.11, the devices are retained, fully functional, and controlled by i2c-6.

I noticed https://bugzilla.opensuse.org/show_bug.cgi?id=1231228 but my problem
seems different. I do not have the jc42 modules automatically detected like
that (it would be nice!) but I do notice that in 6.11, there are new entries
present which did not exist in 6.10:

Oct 04 23:30:38 localhost kernel: i2c i2c-1: Successfully instantiated SPD at
0x50
Oct 04 23:30:38 localhost kernel: i2c i2c-1: Successfully instantiated SPD at
0x51
Oct 04 23:30:38 localhost kernel: i2c i2c-1: Successfully instantiated SPD at
0x52
Oct 04 23:30:38 localhost kernel: i2c i2c-1: Successfully instantiated SPD at
0x53

These are my memory sticks' SPD chips and they were previously only readable by
using the `eeprom` kernel module. The `ee1004` module and `at24` modules would
return nothing. They are now readable without any change by me, using the
ee1004 module, which is in use for the above device.

So, it seems like something changed with the piix4_smbus driver, it having
changed address; and it's now detecting my DRAM SPD, which is nice; And the new
SPD driver works, which is also nice, given that the old one seems to be gone;
but it's not detecting the temperature sensors, and is forgetting(?) them when
I add them.

I will collect the dmesg and hwinfo outputs from each kernel version (6.10.11,
6.11.0, 6.12-rc1) shortly. Is there anything else that might be useful while
I'm at it?

A word of caution to others who might be suffering this bug or one like it: I
experienced very severe hardware faults related to this. After adding the
devices to the controller and rebooting, my monitor started to flash on and off
a few times per second, and after factory-resetting the monitor and rebooting,
my PC would not POST, failing first to train the RAM (three or four different
errors at random, shown on motherboard 8-segment display), and once that was
isolated, failing to detect graphics. A BIOS reset was needed to get it back in
order. It has been OK since. I believe it to be a one-off not triggered by this
bug alone, but it was undoubtedly triggered by my attempt to fix the missing
devices by following standard procedure to add them. It seems fine now, but I
would be reticent not to mention it.


You are receiving this mail because: