[Bug 1236116] New: python-numpy: unreliable dot product calculation

https://bugzilla.suse.com/show_bug.cgi?id=1236116 Bug ID: 1236116 Summary: python-numpy: unreliable dot product calculation Classification: openSUSE Product: openSUSE Tumbleweed Version: Current Hardware: x86-64 OS: openSUSE Tumbleweed Status: NEW Severity: Major Priority: P5 - None Component: Python Assignee: python-maintainers@suse.com Reporter: lpechacek@gmx.com QA Contact: qa-bugs@suse.de CC: eich@suse.com, stefan.bruens@rwth-aachen.de Target Milestone: --- Found By: --- Blocker: --- [CCing Egbert for HPC as the bug may be related to LAPACK and BLAS and Stefan who also maintains Rasterio] I've just spent two afternoons staring into Rasterio tests to root-cause its post-build test failures on i586 [1]. Rasterio uses Numpy for geodetic transformations and there are tests implemented in the package to ensure accuracy. All the tests pass on x86_64 but 15 tests fail on i586. All the test failures boil down to inaccurate calculation of one affine transformation. The minimal test case showing the failure is as follows: [newlines added for easier reading] -----------------------8<--------------------------- fmn:~ # chroot /var/tmp/build-root/openSUSE_Tumbleweed-i586 fmn:/ # python3 Python 3.11.11 (main, Dec 04 2024, 21:44:34) [GCC] on linux Type "help", "copyright", "credits" or "license" for more information. Could not open PYTHONSTARTUP FileNotFoundError: [Errno 2] No such file or directory: '/etc/pythonstart'
from numpy import (array, float64)
array([0.0033329119791008304, -339.9070281885982]) @ array([101985.0, 1])
np.float64(-2.275957200481571e-15)
float64(0.0033329119791008304) * float64(101985.0) + float64(-339.9070281885982)
np.float64(0.0) ----------------------->8--------------------------- According to documentation, the dot product calculation performed by numpy.matmul (the @ operator) is accelerated with help of BLAS/LAPACK [2] but I don't see into the details as I never dealt with Numpy and LAPACK before. IMHO the inaccuracy in calculation is worth attention, and depending on the status of i586 in the openSUSE world, might be worth fixing. I'm filing the bug report in case someone has seen similar symptoms and/or knows what might be the cause. Thanks! [1] https://build.opensuse.org/package/show/Application:Geo/python-rasterio (will build again as soon as upgrade to 1.4.3 is accepted) [2] https://numpy.org/doc/stable/reference/generated/numpy.matmul.html -- You are receiving this mail because: You are on the CC list for the bug.

https://bugzilla.suse.com/show_bug.cgi?id=1236116 https://bugzilla.suse.com/show_bug.cgi?id=1236116#c1 --- Comment #1 from Stefan Brüns <stefan.bruens@rwth-aachen.de> --- The exact result of the calculation is neither 0.0 nor -2.27...e-15, but -0.000000000000011656, or -11.656e-15. The difference is caused probably by the fact i586 uses the classic 387 FPU path using 80bit doubles internally, while x86_64 uses IEEE-754 doubles with a 53 bit mantissa. Any test code which relies on excess precision is just wrong. -- You are receiving this mail because: You are on the CC list for the bug.

https://bugzilla.suse.com/show_bug.cgi?id=1236116 https://bugzilla.suse.com/show_bug.cgi?id=1236116#c2 --- Comment #2 from Stefan Brüns <stefan.bruens@rwth-aachen.de> --- At least for the "index" tests, the problem is likely caused by the default "numpy.floor" op for the index() method. https://github.com/rasterio/rasterio/blob/7adb4f5b32c006067cca2aaa8a4004ed10... v * M * M^-1 (with M * M^-1 = Identity) is only true mathematically, but not numerically. -- You are receiving this mail because: You are on the CC list for the bug.

https://bugzilla.suse.com/show_bug.cgi?id=1236116 https://bugzilla.suse.com/show_bug.cgi?id=1236116#c3 --- Comment #3 from Stefan Brüns <stefan.bruens@rwth-aachen.de> ---
v * M * M^-1 == v -- You are receiving this mail because: You are on the CC list for the bug.

https://bugzilla.suse.com/show_bug.cgi?id=1236116 https://bugzilla.suse.com/show_bug.cgi?id=1236116#c4 Libor Pechacek <lpechacek@gmx.com> changed: What |Removed |Added ---------------------------------------------------------------------------- CC| |lpechacek@gmx.com --- Comment #4 from Libor Pechacek <lpechacek@gmx.com> --- Thanks, Stefan, for the comments!
The difference is caused probably by the fact i586 uses the classic 387 FPU path using 80bit doubles internally, while x86_64 uses IEEE-754 doubles with a 53 bit mantissa.
No idea. I've instead done a few more tests. A 32bit Debian 12 system I installed fresh shows the same symptoms. Next I tried to take BLAS out of the equation. I removed it from the system and installed Numpy using PIP. As Numpy could not delegate to BLAS, the result was aligned with the other platforms again. So the excess numeric bits appear during the handover to BLAS or so.
Any test code which relies on excess precision is just wrong.
At least for the "index" tests, the problem is likely caused by the default "numpy.floor" op for the index() method.
I concur. The immediate trouble is that the "slightly negative" number gets rounded down to -1 which is not the expected zero. Replacing the post-processing operation with numpy.round() solves this immediate problem. This is the root cause of one batch of test failures. The other batch of errors stem from a failure to look up the right pixel in the very same georeferenced image. I only lightly touched on that other problem but I suspect that the numeric errors are larger here. The affine transformation is the same as it's the same geo-image. See how far I get on this one.
v * M * M^-1 == v (with M * M^-1 = Identity) is only true mathematically, but not numerically.
How true. -- You are receiving this mail because: You are on the CC list for the bug.

https://bugzilla.suse.com/show_bug.cgi?id=1236116 https://bugzilla.suse.com/show_bug.cgi?id=1236116#c5 --- Comment #5 from Egbert Eich <eich@suse.com> --- Hi Libor! Ineed, precision issues with 387 extended precision vs. IEEE math are not unexpected. Lapack's own test suite failed on i586. This led to the following change: Thu Feb 23 16:58:25 UTC 2023 - Egbert Eich <eich@suse.com> - Set -mfpmath=sse for the entire build for ix86 platforms on SLE/Leap. Since we build for x86_64, we know that sse is available. This helps to avoid effects from excess precision that can be seen in the test suite. On Factory we leave -ffloat-store for the test suite only as this option comes at a performance penalty. We may see precision related issues in the test suite with future compilers regardless. to the Lapack package - after careful deliberation with Richi (Cc). We did not make this change to 'real' i586 builds (like on Factory) as can't assume SSE to be available there. Since most tests today seem to assume IEEE math precision, its a good idea to repeat tests after building the numerical libraries using a different accelerator than 387 on i586 if this is the suspected cause of failure. -- You are receiving this mail because: You are on the CC list for the bug.

https://bugzilla.suse.com/show_bug.cgi?id=1236116 https://bugzilla.suse.com/show_bug.cgi?id=1236116#c6 --- Comment #6 from Libor Pechacek <lpechacek@gmx.com> --- Thanks, Egbert, for your comment! For me the bottom line is that the effects I'm observing are not unexpected. I myself always tried to stay at arm's length from the floating-point math as soon as I had learned about its peculiarities at the university. Well, the ghost has hunted me down three decades later. :) I'll look into the Rasterio tests if they can be made more robust when I have a while. Thank you all for your support! -- You are receiving this mail because: You are on the CC list for the bug.

https://bugzilla.suse.com/show_bug.cgi?id=1236116 https://bugzilla.suse.com/show_bug.cgi?id=1236116#c7 --- Comment #7 from Egbert Eich <eich@suse.com> --- (In reply to Libor Pechacek from comment #6)
Thanks, Egbert, for your comment! For me the bottom line is that the effects I'm observing are not unexpected.
I myself always tried to stay at arm's length from the floating-point math as soon as I had learned about its peculiarities at the university. Well, the ghost has hunted me down three decades later. :)
I'll look into the Rasterio tests if they can be made more robust when I have a while. Thank you all for your support!
Libor, before you do anything, please think about how much interest exists in running Rasterio on 32bit x86 or whether a 32bit variant would be needed on x86_64. Being that it is python module, the latter is rather unlikely. On openSUSE, we offer 32bit x86 only as a tumbleweed port. Ports often have a very reduced set of packages. Other options may include: 1. Disable build for i586 2. Not run the testsuite on i586 or ignore its results. 3. Ignore the results of the offending tests in %check - if this is possible 4. Not run the offending tests during %check - as far as I can see, you do this already. 5. We create a separate sse-enabled blas/lapack flavor for 32bit x86 on factory for use in such cases. -- You are receiving this mail because: You are on the CC list for the bug.

https://bugzilla.suse.com/show_bug.cgi?id=1236116 https://bugzilla.suse.com/show_bug.cgi?id=1236116#c8 --- Comment #8 from Stefan Brüns <stefan.bruens@rwth-aachen.de> --- (In reply to Egbert Eich from comment #7)
(In reply to Libor Pechacek from comment #6) Libor, before you do anything, please think about how much interest exists in running Rasterio on 32bit x86 or whether a 32bit variant would be needed on x86_64. Being that it is python module, the latter is rather unlikely. On openSUSE, we offer 32bit x86 only as a tumbleweed port. Ports often have a very reduced set of packages. Other options may include: 1. Disable build for i586 2. Not run the testsuite on i586 or ignore its results. 3. Ignore the results of the offending tests in %check - if this is possible 4. Not run the offending tests during %check - as far as I can see, you do this already. 5. We create a separate sse-enabled blas/lapack flavor for 32bit x86 on factory for use in such cases.
I agree with the sentiments "against" i586, but I think getting the tests more robust (or preferably even the underlying functions) has some benefit on its own. The classic x86 FPU is not the only case where precision deviates, code may use FMA, and I have also seen different results on x86-SSE and ARM. I think the most important thing to do is tackle this in a way which is upstreamable. Create a JIRA issue, ask the maintainer how to best move forward. Most maintainers will no longer have an interest in i586, but references to FMA and ARM (and numerical issues in general) may help your case. -- You are receiving this mail because: You are on the CC list for the bug.

https://bugzilla.suse.com/show_bug.cgi?id=1236116 https://bugzilla.suse.com/show_bug.cgi?id=1236116#c9 --- Comment #9 from Egbert Eich <eich@suse.com> --- (In reply to Stefan Brüns from comment #8)
(In reply to Egbert Eich from comment #7)
I agree with the sentiments "against" i586, but I think getting the tests more robust (or preferably even the underlying functions) has some benefit on its own.
The classic x86 FPU is not the only case where precision deviates, code may use FMA, and I have also seen different results on x86-SSE and ARM.
I think the most important thing to do is tackle this in a way which is upstreamable. Create a JIRA issue, ask the maintainer how to best move forward. Most maintainers will no longer have an interest in i586, but references to FMA and ARM (and numerical issues in general) may help your case.
I agree that this would be the best solution. However, it also sounds like a can of worms. The original authors may not have much experience with 'hardening' tests. A good 'incentivizer' may be to determine which tests fail on arm and when using FMA. Arm could easily be tested in OBS, however, I wonder how representative our 'least common denominator' build is for advanced math instructions on arm. The same is true for FMA: one would need an FMA-enabled BLAS library. I'm afraid our DYNAMIC_ARCH OpenBLAS build will not use FMA in common code as it uses CORE2 as the 'least common denominator'. Also, with openblas, what one gets heavily depends on the CPU model used when running the test (which in itself may be a good robustness test). We probably need to find a solution for the time being - for this, I'd still suggest to choose one of the options above. -- You are receiving this mail because: You are on the CC list for the bug.

https://bugzilla.suse.com/show_bug.cgi?id=1236116 https://bugzilla.suse.com/show_bug.cgi?id=1236116#c10 --- Comment #10 from Libor Pechacek <lpechacek@gmx.com> ---
Libor, before you do anything, please think about how much interest exists ...
Egbert, you know me, I guess. I first collect information then I make a decision. :) The bits I needed to know: 1) Numpy is allowed to return basically anything depending on the underlying floating point computation resources. 2) That's a known issue that does not have to be worried about. 3) Looking into the Rasterio upstream issue log, I would say that https://github.com/rasterio/rasterio/issues/3016 clearly illustrates that there is no interest in solving the problem there either. For me the bottom line is that on most platforms Rasterio works fine and isn't missed on those where the tests don't pass. Well, if I can make the situation better, I'll do so, although I'm progressively losing hope as I'm reading the upstream issue records at GitHub. -- You are receiving this mail because: You are on the CC list for the bug.

https://bugzilla.suse.com/show_bug.cgi?id=1236116 https://bugzilla.suse.com/show_bug.cgi?id=1236116#c11 --- Comment #11 from Egbert Eich <eich@suse.com> --- (In reply to Libor Pechacek from comment #10)
Libor, before you do anything, please think about how much interest exists ...
Egbert, you know me, I guess. I first collect information then I make a decision. :)
The bits I needed to know: 1) Numpy is allowed to return basically anything depending on the underlying floating point computation resources. 2) That's a known issue that does not have to be worried about.
Very quickly on this point: Yes, I'd say, this depends very much on the underlying framework and the compiler/compiler options used for Numpy itself. For the same of performance Numpy won't be able to make promises on result precision. It's the same as if you were coding it in Fortran or C and use a blas/lapack implementation underneath. -- You are receiving this mail because: You are on the CC list for the bug.

https://bugzilla.suse.com/show_bug.cgi?id=1236116 Matej Cepl <mcepl@suse.com> changed: What |Removed |Added ---------------------------------------------------------------------------- Assignee|python-maintainers@suse.com |eich@suse.com CC| |mcepl@suse.com -- You are receiving this mail because: You are on the CC list for the bug.

https://bugzilla.suse.com/show_bug.cgi?id=1236116 https://bugzilla.suse.com/show_bug.cgi?id=1236116#c12 Libor Pechacek <lpechacek@gmx.com> changed: What |Removed |Added ---------------------------------------------------------------------------- Resolution|--- |UPSTREAM Status|NEW |RESOLVED --- Comment #12 from Libor Pechacek <lpechacek@gmx.com> --- IMHO there is nothing to be solved on the Numpy side. My guess is that the test failures arise from mixing native Python floating point calculations with Numpy calculations. The failing Rasterio test triggers matrix inverse calculation in Python Affine module (https://github.com/rasterio/affine/blob/d198eaddca348e35291a978f9f713571def1...) and then feeds the matrix into Numpy to obtain the translated coordinates (https://github.com/rasterio/rasterio/blob/main/rasterio/transform.py#L517-L5...). I'll file my conjecture as a problem report with Rasterio experts and close this bug, if you don't mind. For openSUSE, I'll disable x86 builds so that we don't waste resources. -- You are receiving this mail because: You are on the CC list for the bug.

https://bugzilla.suse.com/show_bug.cgi?id=1236116 https://bugzilla.suse.com/show_bug.cgi?id=1236116#c13 --- Comment #13 from Libor Pechacek <lpechacek@gmx.com> --- (In reply to Libor Pechacek from comment #12)
I'll file my conjecture as a problem report with Rasterio experts
https://github.com/rasterio/rasterio/issues/3310 -- You are receiving this mail because: You are on the CC list for the bug.

https://bugzilla.suse.com/show_bug.cgi?id=1236116 https://bugzilla.suse.com/show_bug.cgi?id=1236116#c14 --- Comment #14 from Egbert Eich <eich@suse.com> --- Libor, it's best to document this in the upstream project - even though they will not do nothing. At least it becomes a 'known issue' for the next person to stumble over it. Thank you! -- You are receiving this mail because: You are on the CC list for the bug.
participants (1)
-
bugzilla_noreply@suse.com