Comment # 4 on bug 1172541 from
I installed kernel-default-5.7.1-1.2.x86_64 (except aarch64 on one 
machine).  The three machines that froze up on
kernel-default-5.6.14-1.1.x86_64, also froze up on 5.7.1.  I kept the
other six on 5.7.1 for about 48 hours witn no freezeups.  Again there
is no trace of the problem in the logs, when it happens.  

I have two machines (Intel NUC6CAYH, Celeron J3455, Intel HD Graphics
500, 2x4Gb RAM, Realtek RTL8111/8168/8411 NIC) with identical hardware,
for a quick replacement if the more critical one fails: this is the main
router, music library, directory server, etc.  The other is a leaf node
that does audio performance.  The router is always the first to freeze
up, whereas the leaf node has never frozen yet, despite substantial net
traffic for the music.  I suspect that hardware is irrelevant, but it's
way too soon to blow off this possibility.  I suspect that the roles
exercise a vulnerable data path on the router, but there is no evidence
of this either.

However, when I put 5.7.1 on my "non-failing" hosts I got a rash of
connection failures.  This is like chasing ghosts: here's the issue that
I worked on first, since it was the most mission-critical.  I don't
seriously expect anyone to figure it out, but I'm getting it on record
in case it provides a clue.  My publicly exposed webserver is on a VM
(running 5.7.1 on the Celeron).  IPv4+6 connections from the wild side
come into the main router (running 5.6.12) and get DNAT to the
webserver.  HTTP and HTTPS on IPv4 work.  IPv6 from the wild side used
to work, not now, but with the "comorbid" network issues IPv6 is too
complicated to give any clues.  A tester on the webserver VM attempts
https://www.jfcarter.net/ and it times out, both IPv4 and IPv6.  I did
lots of troubleshooting including tcpdumps and reboots (into 5.7.1). The
client would initiate a TLS connection with SNI, the server would send
its cert which the client would accept as valid, but the server then
just sat there, not sending a TLS session ticket.  Finally I tried a
scorched earth solution: rebooted all the leaf nodes and VMs into
5.6.12.  The net issues miraculously vanished (not seen for 2 days), and
specifically, web service was back to normal.

It looks like kernel 5.7.2 is now out.  I'm going to install it on one
vulnerable machine and see if it freezes.  But in the meantime I'm going
to learn to use git and I'll try to use the bisect feature to find the
commit that messed it up.  I wish there were a symptom that would appear
immediately and left the machine alive enough that you could log in
remotely and boot back to 5.6.12.  But the world isn't arranged for our
convenience.  

If I ever accidentally uninstall 5.6.12 I'll have a real problem.
Several times I've wanted to revert a bad update, but my only recourse
was either to wait for SuSE to release a fixed version or to fix it
myself, e.g. https://bugzilla.opensuse.org/show_bug.cgi?id=1172256 .
The present issue (bug 1172541) finally got me moving to set up an
archive of all RPMs installed on any of my machines, keeping no longer
installed versions for a month or two, in case of reversion.  Did I
reimplement the wheel, that is, does SuSE have such an archive server?
If not, maybe you should.


You are receiving this mail because: