Re: Avoid breaking running systems with zypper up (Was: The future of OpenSUSE Leap (rant), PLUS: nvidia)

27 Jul 2022

      Am 26.07.22 um 22:31 schrieb Patrick Shanahan:
...
fwiw: I have utilized nvidia for many years and it is not uncommon for
any particular nvidia card to fail for a particular kernel/driver or for
the driver to be issued somewhat late.  that said, it is infrequent and is
simple to boot the previous kernel and not be concerned with anything
else.  soon another driver and/or kernel will be issued and solve the
problem.
In my case, that approach didn't work. This is frustrating since I know 
a bit about software:

https://stackoverflow.com/users/34088/aaron-digulla

Also, I feel like there are a dozen different solutions for this 
problem. All of them work but not always. I, for example, have never 
tried to boot an old kernel because someone said you should never do 
this because the user space tools on disk might break when the kernel 
suddenly changes which could lead to all kinds of strange problems.
...
the important thing is to accurately report the conditions and equipment
so the maintainers may determine the correction needed.
emphasis on "accurately report the conditions and equipment"
it is virtually impossible to test against all possible conditions!
I don't agree at all. I think we are in this mess because:

1. The kernel doesn't have a stable binary driver API like, say, the 
AmigaOS already had in 1984.

2. The kernel comes with a ton of config options. This makes is very 
versatile but also brittle because you if you're not very careful, you 
have a combinatory explosion of test cases.

3. Software should have a unit test code coverage between 40-60%. What's 
the coverage for kernel 5.14?

And don't tell me you can't test hardware drivers. I did it in AROS and 
in commercial software as well. Yes, testing the peeking and poking in 
hardware registers is expensive to test but still not impossible. A 
coverage of 40% is still easy to achieve because drivers contain a lot 
of other code (config processing, error handling, management, ...).

4. There is no centralized automated CI system where everyone working on 
the kernel or a driver can push patches and drivers to see if they even 
compile.

5. Every Linux company spends countless hours on maintaining their own 
kernel patches. So even if your code works with the vanilla kernel, 
there is a good chance that it won't work with anything else.

6. Many developers don't care to write useful error messages: Ones that 
contain all the information necessary to fix the issue.

All of these things are this way because someone wants them to be that 
way. Linus opposes the idea of a stable external and driver API in the 
kernel (while ranting that the stupid desktop guys change their API all 
the time...). There are good reasons for his stance but one of the 
drawbacks is that drivers break every now and then. And then, even 
someone with 300'000+ points on stackoverflow has a hard time to fix it. 
And I have to waste your time asking questions.

For these reasons: A mess like this doesn't happen by pure chance. It 
takes meticulous planning and careful execution by many smart people 
over many years. ;-)

Anyway ... thanks for your time.

Good night,

-- 
Aaron "Optimizer" Digulla a.k.a. Philmann Dark
"It's not the universe that's limited, it's our imagination.
Follow me and I'll show you something beyond the limits."
http://blog.pdark.de/

Re: Avoid breaking running systems with zypper up (Was: The future of OpenSUSE Leap (rant), PLUS: nvidia)

Aaron Digulla