[opensuse] Novell Bugzilla - At it Again - Bugs Apparently Dismissed Without Sufficient Investigation
List, How many of you have ever gotten the feeling that some of the developers with Novell bugzilla work harder trying to punt bugs than they do actually trying to understand whether a valid problem exists or not? I have been frustrated a number of times when gone the extra mile to provided a detailed documented bug submission, only to have someone at Novell try to punt the bug when it is apparent that not one iota of investigation has been done sufficient to either rule in or rule out a problem. What needs to be done to motivate some of the guys over there to get over the "take the easy way out approach?" Marcus, your thoughts? openSuSE suffers greatly when bugs are allowed to propagate without correction under this philosophy. Valid opportunities are missed that could better the product and help set Novell apart as a leader. We all share a common goal of working to help make openSuSE and the remaining Novell distros the best they can be, the most useful and the most reliable systems out there. It benefits the user base because we all have a better distro and it benefits Novell because it could have the most polished and reliable disto to offer. But No, sadly, some people just don't want to have to do the hard work to figure out if a valid bug is present when they can arbitrarily close it for some cockemamy reason. The latest classic example: https://bugzilla.novell.com/show_bug.cgi?id=376165 A software bug is being reported, but since the software log says the error is hardware, they try and close the bug. Notwithstanding the entire bug focuses on the systems response to having the nvidia kernel module loaded, and how it runs fine without the software, the entire bug was closed as invalid because the mcelog says some of the errors are hardware error. And, if mcelog say so, then it's got to be true! Sad, really. I have worked with some of the best developers you could ever hope to work with on bugs with Novell. However, there is a not so small minority that come across as not even wanting to give you the time of day much less take even a cursory look at the issues to be able to make a reasoned determination. The lack of consistency hurts (and no not me emotionally either ;-) What have others experienced? Where does this issue need to get raised? I certainly expect that when a bug report is submitted that it is investigated and at least analyzed to the point where your can either, with certainty, rule hardware in or out, rule software in or out and know, based upon that investigation that it isn't a combination of both. If you know somebody in Novell, then by all means address these issues to them so hopefully we concentrate on fixing bugs and not on dismissing them. -- David C. Rankin, J.D., P.E. Rankin Law Firm, PLLC 510 Ochiltree Street Nacogdoches, Texas 75961 Telephone: (936) 715-9333 Facsimile: (936) 715-9339 www.rankinlawfirm.com -- To unsubscribe, e-mail: opensuse+unsubscribe@opensuse.org For additional commands, e-mail: opensuse+help@opensuse.org
On Friday 2008-04-04 08:20, David C. Rankin wrote:
How many of you have ever gotten the feeling that some of the developers with Novell bugzilla work harder trying to punt bugs than they do actually trying to understand whether a valid problem exists or not?
Sometimes. But Fedora's BZ is __much__ worse. Just today I got automated mails "Fedora apologizes that these issues have not been resolved yet. We're sorry it's taken so long for your bug to be properly triaged and acted on." --- on bugs noone seriously *acted* upon in the lifetime of the product. At Novell, they at least get it done someday, or bump the version tag, or talk to you time and again :-) The oldest active SUSE-related bugs (i.e. not upstream package) have ids 262341 (opened a year ago) and 306344, so are quite recent. Everything is in order. -- To unsubscribe, e-mail: opensuse+unsubscribe@opensuse.org For additional commands, e-mail: opensuse+help@opensuse.org
Jan Engelhardt wrote:
On Friday 2008-04-04 08:20, David C. Rankin wrote:
How many of you have ever gotten the feeling that some of the developers with Novell bugzilla work harder trying to punt bugs than they do actually trying to understand whether a valid problem exists or not?
Sometimes. But Fedora's BZ is __much__ worse. Just today I got automated mails "Fedora apologizes that these issues have not been resolved yet. We're sorry it's taken so long for your bug to be properly triaged and acted on." --- on bugs noone seriously *acted* upon in the lifetime of the product. At Novell, they at least get it done someday, or bump the version tag, or talk to you time and again :-)
The oldest active SUSE-related bugs (i.e. not upstream package) have ids 262341 (opened a year ago) and 306344, so are quite recent.
Everything is in order.
Jan, I think you miss the point....David is rightly saying that the reason there are no old bugs is because they keep closing them for the wrong reasons. I agree with you that SuSE is probably the best overall Linux distro (not without its' warts, admittedly) and I've used many of them starting with Slackware when it was distributed on over 60 floppy disks and ran on my 286 machine IF you could get it to load and compile before the refrigerator bumped the power and reset the machine again :) through RedHat and Umbuntu and a few in the back of books on a CD. I keep coming back to SuSE BUT and it is a very big BUT, David is absolutely right about bugs. While I still support SuSE and do all of the Alpha/Beta testing, I now rarely bother to do bug reports because my experience has been that when you present something to them that is too hard or out of the ordinary or even just inconvenient, it gets shunted aside. There are hundreds of spelling errors and wrong colors and yes even a lot of truly nasty critters that they DO fix, but I expect those and expect them to be resolved in due course. David reported, and I have previously reported and documented and researched thoroughly serious bugs as have many others I have seen here and many, no, most of those types have been shunted aside for flimsy or even no excuses. All in all, SuSE is a great distro and many, even most of the guys working the bugs are doing good jobs but there are way too many that do as David has reported. No, EVERYTHING IS NOT IN ORDER. Just because Fedora or XYZ is worse is NOT justification for openSUSE to say because we are somewhat better, everything is fine. To those in openSUSE that *are* doing the super job, keep it up, we need you. To those like David is referring to, go work for Fedora and bring them up to SuSE's standards! Richard -- To unsubscribe, e-mail: opensuse+unsubscribe@opensuse.org For additional commands, e-mail: opensuse+help@opensuse.org
On 05/04/2008, Richard Creighton <ricreig@gmail.com> wrote:
Jan, I think you miss the point....David is rightly saying that the reason there are no old bugs is because they keep closing them for the wrong reasons. ... support SuSE and do all of the Alpha/Beta testing, I now rarely bother to do bug reports because my experience has been that when you present something to them that is too hard or out of the ordinary or even just inconvenient, it gets shunted aside.
I have to disagree. Having used various projects' bug trackers I have the best experience with the openSUSE bugzilla. Bugs are generally triaged very quickly (<24hrs mostly in my experience) and either resolved as duplicates or assigned to the appropriate people. The assignees are also usually very responsive at acknowledging and requesting additional information. Bugs that will be fixed first are generally a) the most critical, and b) the easiest to reproduce or identify. The former depends on the severity of the problem and the number of users it affects. The latter depends how good the bug report is. Review http://en.opensuse.org/Bugs#Reporting_a_Bug for information about how to best report a bug in some specific products. There is little point in complaints such as this one about a specific bug being closed when you do not believe it was resolved. Instead take the appropriate action. Bear in mind that the developers closing bugs are all individual people. It is not a Novell conspiracy to close n bugs per day, in fact the Novell bugzilla is used by many people, not just Novell employees. The phrase "the guys over there" does not make sense. The Bugzilla instance is a service which can be used freely by anyone. I also object to the suggestion of bugs being closed for "cockemamy reasons". Ultimately whether a bug can or should be fixed is up to the developer of the software the bug is present in, and or project management. Would you rather a developer spends weeks fixing a bug that may or may not be valid, where the reporter provides insufficient information, all the while forsaking hundreds of other valid bugs? Some suggestions of how to better deal with bugs being closed: If a bug is resolved as NORESPONSE or WORKSFORME this usually means you have not provided enough information for the assignee to reproduce. , or identify the cause of the bug. Often it will have been NEEDINFO before this. If you can provide the information re-open the bug with the required information. If you cannot provide sufficient information then you can always ask on the mailing list whether anyone else experiences the same issue, and for help diagnosing the problem. If a bug is resolved as FIXED when it is not fixed then re-open and attach the appropriate evidence (logs, screenshots...), ensuring that you have tested with the version of software that it is supposedly fixed in. If a bug is resolved as LATER then it could mean that it can't be fixed in the immediate future because of policy such as a string/feature freeze. It could mean that the assignee does not have time to fix it for the coming version, but the bug is acknowledged. Bear in mind that not everything can be fixed, there are limited manhours available. If a bug is left in a NEW state for an extended length of time, you can try asking one of the maintainers about it, or asking on the mailing list. This is fairly unusual for important bugs. If a bug is marked as WONTFIX then this means the bug is acknowledged as being a bug, but won't be fixed. This might happen if the bug/enhancement is too difficult to fix, or if a requested feature conflicts with a project policy/aim. The person who marks it as wontfix should also include a reason for marking it WONTFIX. If you disagree with the reasoning, and can find enough people who agree with you then it might be possible to have the decision reconsidered. Especially if the reason is difficulty/expense and someone will volunteer to fix it. If a bug is resolved as INVALID it is not considered a bug. Could be it is simply a junk report, there are surprisingly regular reports that are almost spam e.g. no details at all are provided. Or it is expected behaviour, or it is not a bug in the software at all. In this case it was a) probably a hardware problem, and b) problem was triggered by third party proprietary closed source driver, which the developers have no control over. Also remember that unless the bug is in a suse specific product, or caused by a suse specific patch you can report the bug upstream. Hope this helps -- Benjamin Weber -- To unsubscribe, e-mail: opensuse+unsubscribe@opensuse.org For additional commands, e-mail: opensuse+help@opensuse.org
Benji Weber wrote:
On 05/04/2008, Richard Creighton <ricreig@gmail.com> wrote:
Jan, I think you miss the point....David is rightly saying that the reason there are no old bugs is because they keep closing them for the wrong reasons.
...
support SuSE and do all of the Alpha/Beta testing, I now rarely bother to do bug reports because my experience has been that when you present something to them that is too hard or out of the ordinary or even just inconvenient, it gets shunted aside.
I have to disagree. Having used various projects' bug trackers I have the best experience with the openSUSE bugzilla.
Bugs are generally triaged very quickly (<24hrs mostly in my experience) and either resolved as duplicates or assigned to the appropriate people. The assignees are also usually very responsive at acknowledging and requesting additional information.
Bugs that will be fixed first are generally a) the most critical, and b) the easiest to reproduce or identify. The former depends on the severity of the problem and the number of users it affects. The latter depends how good the bug report is. Review http://en.opensuse.org/Bugs#Reporting_a_Bug for information about how to best report a bug in some specific products.
There is little point in complaints such as this one about a specific bug being closed when you do not believe it was resolved. Instead take the appropriate action. Bear in mind that the developers closing bugs are all individual people. It is not a Novell conspiracy to close n bugs per day, in fact the Novell bugzilla is used by many people, not just Novell employees. The phrase "the guys over there" does not make sense. The Bugzilla instance is a service which can be used freely by anyone. I also object to the suggestion of bugs being closed for "cockemamy reasons". Ultimately whether a bug can or should be fixed is up to the developer of the software the bug is present in, and or project management. Would you rather a developer spends weeks fixing a bug that may or may not be valid, where the reporter provides insufficient information, all the while forsaking hundreds of other valid bugs?
Some suggestions of how to better deal with bugs being closed:
If a bug is resolved as NORESPONSE or WORKSFORME this usually means you have not provided enough information for the assignee to reproduce. , or identify the cause of the bug. Often it will have been NEEDINFO before this. If you can provide the information re-open the bug with the required information. If you cannot provide sufficient information then you can always ask on the mailing list whether anyone else experiences the same issue, and for help diagnosing the problem.
If a bug is resolved as FIXED when it is not fixed then re-open and attach the appropriate evidence (logs, screenshots...), ensuring that you have tested with the version of software that it is supposedly fixed in.
If a bug is resolved as LATER then it could mean that it can't be fixed in the immediate future because of policy such as a string/feature freeze. It could mean that the assignee does not have time to fix it for the coming version, but the bug is acknowledged. Bear in mind that not everything can be fixed, there are limited manhours available.
If a bug is left in a NEW state for an extended length of time, you can try asking one of the maintainers about it, or asking on the mailing list. This is fairly unusual for important bugs.
If a bug is marked as WONTFIX then this means the bug is acknowledged as being a bug, but won't be fixed. This might happen if the bug/enhancement is too difficult to fix, or if a requested feature conflicts with a project policy/aim. The person who marks it as wontfix should also include a reason for marking it WONTFIX. If you disagree with the reasoning, and can find enough people who agree with you then it might be possible to have the decision reconsidered. Especially if the reason is difficulty/expense and someone will volunteer to fix it.
If a bug is resolved as INVALID it is not considered a bug. Could be it is simply a junk report, there are surprisingly regular reports that are almost spam e.g. no details at all are provided. Or it is expected behaviour, or it is not a bug in the software at all. In this case it was a) probably a hardware problem, and b) problem was triggered by third party proprietary closed source driver, which the developers have no control over.
Also remember that unless the bug is in a suse specific product, or caused by a suse specific patch you can report the bug upstream.
Hope this helps
-- Benjamin Weber
Benjamin, The bug David refers to is one of many. I did not file the particular bug he referred to, however the issue isn't a particular bug, it is a mentality being exhibitited. I read your response and the bugs in my case were rarely, if ever pertaining to 3rd party hardware problems. I don't want to get into a pissing contest, I have better things to do, but as an example, one of many, on their website, they have a nice article about how to set up and boot a RAID system. I refer you to http://en.opensuse.org/How_to_install_SUSE_Linux_on_software_RAID Well, I set up my system just like it said except I used RAID 5 for /home and 1.5TB total space but otherwise, as they suggested and with NO SEPARATE partition for /boot outside of the raid, just as shown. It worked fine. However, one day I wanted to run the YAST REPAIR, and it totally hosed my installed system and I filed bug 304657 among a bunch of others involving repair and raid problems such as 309040 329702 331604 331532 and many many others, not all my own but similar. If you research these and others, many filed originally by others than myself, these were NOT any of the type you mentioned in your response to me regarding David's complaint that too often, bugs lead nowhere despite clear, lucid and concerted efforts by the person(s) filing the report offering suggestions, help, time and effort to assist in debugging. Often, I found the report closed, and only got action after I reopened it. Some, after reopening got worked on and had long chains of interaction. Unfortunately few were resolved, but they were closed. Finally many bugs, including many of mine were closed as 'insufficient resources' or 'resources not allocated to debug this' or in some cases, obtuse 'promises' to look into it in "the next release". What I believe David is saying, and I know I am saying is this is a great way to win friends and influence enemies and encourage SERIOUS alpha and beta testing and ASSISTANCE by the community to Novell and openSUSE, NOT! I have found it much more productive to continue Alpha/Beta testing and attempting to find answers and helping people individually, one on one, rather than wasting my time writing bug reports about what color looks best as a background for green chameleons. There are a few people at both Novell and openSUSE.org that take the time to do the in-depth research and debugging for the serious problems, and I do communicate whatever tidbits of knowledge I come up with, but the buglist system, as implemented currently, does not inspire a lot of confidence for problems more serious than white lettering on a light-grey background in a menu or pop-up. Those problems, as David has alluded, seem to rarely get the attention or resources needed to resolve them. Richard -- To unsubscribe, e-mail: opensuse+unsubscribe@opensuse.org For additional commands, e-mail: opensuse+help@opensuse.org
On Sat, Apr 5, 2008 at 10:06 AM, Richard Creighton <ricreig@gmail.com> wrote:
Often, I found the report closed, and only got action after I reopened it.
Yes, that works. How may people thing that having someone from opensuse CLOSE it means "get lost"? I suspect quite a few.
Some, after reopening got worked on and had long chains of interaction. Unfortunately few were resolved, but they were closed. Finally many bugs, including many of mine were closed as 'insufficient resources' or 'resources not allocated to debug this' or in some cases, obtuse 'promises' to look into it in "the next release".
Well at least that was honest. If the bug is rare enough and hard enough to reproduce, it is by definition not a show stopper, and can be avoided, will/may be fixed by future releases, then good resource allocation dictates you walk away. Patient: "Doctor, it hurts when I do this". Doctor: "Don't do that". I am not aware of a single piece of software that is bug free. Yes, your system got hozed. But realistically, how many people are going to be desperately trying to repair a system that A) wasn't broken, and B) had other OS version installed, and C) was undergoing dangerous maintenance on a system without disconnecting things you don't want touched? I don't like it when my bugs get ignored either, but I report them anyway because (presumably) they will forward them upstream to the package maintainer, or when work is started on that package for the next releas they will search the bug database and attempt to fix them all. Just don't let them close them, and don't fail to respond with the documentation they need. -- ----------JSA--------- -- To unsubscribe, e-mail: opensuse+unsubscribe@opensuse.org For additional commands, e-mail: opensuse+help@opensuse.org
David C. Rankin wrote:
https://bugzilla.novell.com/show_bug.cgi?id=376165
A software bug is being reported, but since the software log says the error is hardware, they try and close the bug. Notwithstanding the entire bug focuses on the systems response to having the nvidia kernel module loaded, and how it runs fine without the software, the entire bug was closed as invalid because the mcelog says some of the errors are hardware error. And, if mcelog say so, then it's got to be true! Sad, really.
Am I missing something? You're asserting that the problem appears when the nvidia module is loaded and goes away when the nv module is loaded? The nvidia module is not open-source AFAIK and was provided by the hardware vendor, so surely the hardware vendor is the right place to report the problem. Also that specific driver is the subject of a dispute with some kernel authors, so you can hardly expect them to spend time on it when the open source driver is working. And yes, I know it provides more functionality but that is because the hardware vendor has not released sufficient information to build the functionality into the open-source driver. So talking to the hardware vendor sounds like exactly the right course of action. Or did I misunderstand? Cheers, Dave -- To unsubscribe, e-mail: opensuse+unsubscribe@opensuse.org For additional commands, e-mail: opensuse+help@opensuse.org
Dave Howorth wrote:
David C. Rankin wrote:
https://bugzilla.novell.com/show_bug.cgi?id=376165
A software bug is being reported, but since the software log says the error is hardware, they try and close the bug. Notwithstanding the entire bug focuses on the systems response to having the nvidia kernel module loaded, and how it runs fine without the software, the entire bug was closed as invalid because the mcelog says some of the errors are hardware error. And, if mcelog say so, then it's got to be true! Sad, really.
Am I missing something? You're asserting that the problem appears when the nvidia module is loaded and goes away when the nv module is loaded?
The nvidia module is not open-source AFAIK and was provided by the hardware vendor, so surely the hardware vendor is the right place to report the problem. Also that specific driver is the subject of a dispute with some kernel authors, so you can hardly expect them to spend time on it when the open source driver is working. And yes, I know it provides more functionality but that is because the hardware vendor has not released sufficient information to build the functionality into the open-source driver. So talking to the hardware vendor sounds like exactly the right course of action.
Or did I misunderstand?
Cheers, Dave
Dave, Yes, I think there is a misunderstanding. The problem *IS NOT* the nvidia kernel module it self, it works fine. What I suspect the problem *IS*, is the way the system handles the memory mapping, etc. *after* the module is loaded. Forget that the module involved is nvidia. It could just as well be called "The bunch of 1's and 0's, this many of them, that goes here when loaded" for purposes of this discussion. What I _am_ saying is that on my Tyan board with its chipset/bios combination, and multicore AMD processors, the software (another module/kernel/pick you favorite part), is in fatal conflict with the driver and that for debugging purposes, there needs to be sufficient investigation to either rule in or rule out the problem. -- David C. Rankin, J.D., P.E. Rankin Law Firm, PLLC 510 Ochiltree Street Nacogdoches, Texas 75961 Telephone: (936) 715-9333 Facsimile: (936) 715-9339 www.rankinlawfirm.com -- To unsubscribe, e-mail: opensuse+unsubscribe@opensuse.org For additional commands, e-mail: opensuse+help@opensuse.org
David C. Rankin wrote:
Yes, I think there is a misunderstanding. The problem *IS NOT* the nvidia kernel module it self, it works fine. What I suspect the problem *IS*, is the way the system handles the memory mapping, etc. *after* the module is loaded. Forget that the module involved is nvidia. It could just as well be called "The bunch of 1's and 0's, this many of them, that goes here when loaded" for purposes of this discussion.
Well, if you can reproduce the problem with another module to which you have the source then I'm sure they will investigate. You could always take a small module and add an array containing whatever pattern of 1s and 0s you think is important :)
What I _am_ saying is that on my Tyan board with its chipset/bios combination, and multicore AMD processors, the software (another module/kernel/pick you favorite part), is in fatal conflict with the driver and that for debugging purposes, there needs to be sufficient investigation to either rule in or rule out the problem.
Agreed. The issue is who should do that investigation. You seem to believe that it should be Novell/kernel developers. I - and others AFAICT - believe it should be nvidia. And they're unlikely to do it unless you report it to them. Cheers, Dave -- To unsubscribe, e-mail: opensuse+unsubscribe@opensuse.org For additional commands, e-mail: opensuse+help@opensuse.org
Dave Howorth wrote:
David C. Rankin wrote:
Yes, I think there is a misunderstanding. The problem *IS NOT* the nvidia kernel module it self, it works fine. What I suspect the problem *IS*, is the way the system handles the memory mapping, etc. *after* the module is loaded. Forget that the module involved is nvidia. It could just as well be called "The bunch of 1's and 0's, this many of them, that goes here when loaded" for purposes of this discussion.
Well, if you can reproduce the problem with another module to which you have the source then I'm sure they will investigate. You could always take a small module and add an array containing whatever pattern of 1s and 0s you think is important :)
What I _am_ saying is that on my Tyan board with its chipset/bios combination, and multicore AMD processors, the software (another module/kernel/pick you favorite part), is in fatal conflict with the driver and that for debugging purposes, there needs to be sufficient investigation to either rule in or rule out the problem.
Agreed. The issue is who should do that investigation. You seem to believe that it should be Novell/kernel developers. I - and others AFAICT - believe it should be nvidia. And they're unlikely to do it unless you report it to them.
Cheers, Dave
Good point. However, in the past with the ATI drivers, the way it worked was that I would report it to Novell and then "we" me working with the Novell developers would get the ATI people involved and then work the issue with all interested parties in the loop until it was either (1) fully understood; or (2) fixed. (You know, the logical way it should work). See: https://bugzilla.novell.com/show_bug.cgi?id=338930 https://bugzilla.novell.com/show_bug.cgi?id=338947 https://bugzilla.novell.com/show_bug.cgi?id=340459 https://bugzilla.novell.com/show_bug.cgi?id=344135 However, in this instance, the Novell response was "don't bother us" with your issue, go away? WTF? That is the wrong answer even if it isn't a Novell issue. When Novell leads people to the nvidia driver, designs and offers its own "1-click" install for the driver through YAST, and promotes the fact that the driver install is a simple matter with its distro, then Novell certainly has a vested interest in at least being interested or part of the solution in cases where it doesn't work. Especially if it has the potential to affect a category of bios/chipset combinations. I have found it makes far more sense to have Novell in the loop when addressing a problem affecting openSuSE that just beating on vendors doors saying this doesn't work with SuSE. In either regard, I would hope to see the developers take more of an interest in seeing problems fixed than simply acknowledging the issue and saying "not my problem, go away" and certainly not to summarily dismiss a potential software issue by saying "the software says it isn't broken - so it isn't - now go away." We'll see what kind of response we get from Nvidia. -- David C. Rankin, J.D., P.E. Rankin Law Firm, PLLC 510 Ochiltree Street Nacogdoches, Texas 75961 Telephone: (936) 715-9333 Facsimile: (936) 715-9339 www.rankinlawfirm.com -- To unsubscribe, e-mail: opensuse+unsubscribe@opensuse.org For additional commands, e-mail: opensuse+help@opensuse.org
-----BEGIN PGP SIGNED MESSAGE----- Hash: SHA1 The Friday 2008-04-04 at 01:20 -0500, David C. Rankin wrote:
List,
How many of you have ever gotten the feeling that some of the developers with Novell bugzilla work harder trying to punt bugs than they do actually trying to understand whether a valid problem exists or not? I have been frustrated a number of times when gone the extra mile to provided a detailed documented bug submission, only to have someone at Novell try to punt the bug when it is apparent that not one iota of investigation has been done sufficient to either rule in or rule out a problem.
...
it for some cockemamy reason. The latest classic example:
https://bugzilla.novell.com/show_bug.cgi?id=376165
A software bug is being reported, but since the software log says the error is hardware, they try and close the bug. Notwithstanding the entire bug focuses on the systems response to having the nvidia kernel module loaded, and how it runs fine without the software, the entire bug was closed as invalid because the mcelog says some of the errors are hardware error. And, if mcelog say so, then it's got to be true! Sad, really.
The report has some points not clear. For one thing, if the problem occurs _only_ when the nvidia driver loaded, the one made by nvidia, then you will have a very hard time convincing anyone to even look at it. It is closed source! The kernel puts a message saying "Tainted!", and developers will not touch it as if it were leprous. They have their reasons and I wont argue them, nor who is to blame, but it is a sad situation for users, who certainly are not to be blamed. This is like a city transport strike: they go against the management, but it is the users who suffer it. Then at some point in the report you say that you got locks or crashes, days after removing that driver, and the log says "hardware". When the assignee says "its' hardware, then invalid", you say that it was triggered by having the nvidia module loaded. You see, if it happens when or after loading the nvidia module, they will not look at it. If it happens when loading or after the open source nv then you have a point, even if it says "hardware", but by reporting it on the same bugzilla as the one mentioning "nvidia", it will be dismissed point blank. I have had bugs dismissed just because I had a vmware or nvidia driver loaded, even if the bug was totally unrelated to those modules, and closed as invalid, no questions asked. I had to demonstrate that the bug reproduced without those closed source things and reopen. They could have asked whether I could reproduce the bug without those modules, but no, they closed it fast. They should find a compromise, a method, of handling these situations. Users need these closed source components. Linux is not complete without it. Have a look, consider what the problem might be, forward the report to nvidia or what ever, collaborate a bit in resolving it.... - -- Cheers, Carlos E. R. -----BEGIN PGP SIGNATURE----- Version: GnuPG v2.0.4-svn0 (GNU/Linux) iD8DBQFH9gT/tTMYHG2NR9URAmIqAJ0Yabt06+6QWGBSxPhQA8MxU/jlAQCfVqXH uviyL4+nEyiE67oE6Il+beI= =AQI7 -----END PGP SIGNATURE----- -- To unsubscribe, e-mail: opensuse+unsubscribe@opensuse.org For additional commands, e-mail: opensuse+help@opensuse.org
Carlos E. R. wrote:
The report has some points not clear.
For one thing, if the problem occurs _only_ when the nvidia driver loaded, the one made by nvidia, then you will have a very hard time convincing anyone to even look at it. It is closed source! The kernel puts a message saying "Tainted!", and developers will not touch it as if it were leprous.
They have their reasons and I wont argue them, nor who is to blame, but it is a sad situation for users, who certainly are not to be blamed. This is like a city transport strike: they go against the management, but it is the users who suffer it.
Carlos, If they had given that answer I would be 100% happy with it. The problem I have with the response I received is that it smelled very much like a George Bush justification for closing the bug. You know the type, an explanation that just can't be trusted. -- David C. Rankin, J.D., P.E. Rankin Law Firm, PLLC 510 Ochiltree Street Nacogdoches, Texas 75961 Telephone: (936) 715-9333 Facsimile: (936) 715-9339 www.rankinlawfirm.com -- To unsubscribe, e-mail: opensuse+unsubscribe@opensuse.org For additional commands, e-mail: opensuse+help@opensuse.org
On Friday 04 April 2008 17:15:45 David C. Rankin wrote:
Carlos E. R. wrote:
The report has some points not clear.
For one thing, if the problem occurs _only_ when the nvidia driver loaded, the one made by nvidia, then you will have a very hard time convincing anyone to even look at it. It is closed source! The kernel puts a message saying "Tainted!", and developers will not touch it as if it were leprous.
They have their reasons and I wont argue them, nor who is to blame, but it is a sad situation for users, who certainly are not to be blamed. This is like a city transport strike: they go against the management, but it is the users who suffer it.
Carlos,
If they had given that answer I would be 100% happy with it. The problem I have with the response I received is that it smelled very much like a George Bush justification for closing the bug. You know the type, an explanation that just can't be trusted.
You seem to be misunderstanding what "mce" is. A machine check exception is the hardware itself telling you that something has gone badly wrong. There is no interpretation involved in the software. The software just logs the message If the mce says it is a hardware problem, you can count on its being a hardware problem Anders -- Madness takes its toll -- To unsubscribe, e-mail: opensuse+unsubscribe@opensuse.org For additional commands, e-mail: opensuse+help@opensuse.org
Anders Johansson wrote:
On Friday 04 April 2008 17:15:45 David C. Rankin wrote:
Carlos E. R. wrote:
The report has some points not clear.
For one thing, if the problem occurs _only_ when the nvidia driver loaded, the one made by nvidia, then you will have a very hard time convincing anyone to even look at it. It is closed source! The kernel puts a message saying "Tainted!", and developers will not touch it as if it were leprous.
They have their reasons and I wont argue them, nor who is to blame, but it is a sad situation for users, who certainly are not to be blamed. This is like a city transport strike: they go against the management, but it is the users who suffer it. Carlos,
If they had given that answer I would be 100% happy with it. The problem I have with the response I received is that it smelled very much like a George Bush justification for closing the bug. You know the type, an explanation that just can't be trusted.
You seem to be misunderstanding what "mce" is. A machine check exception is the hardware itself telling you that something has gone badly wrong. There is no interpretation involved in the software. The software just logs the message
If the mce says it is a hardware problem, you can count on its being a hardware problem
Anders
Well, your right as usual about my understanding. What I can't get my mind around is the fact that the mce(s) go away as long as the driver isn't loaded. I guess that's possible if the driver is calling hardware that doesn't get called otherwise. However, it just seems that if the mce(s) go away if the driver isn't loaded, the the driver or the systems response to the driver was the problem and not hardware. I'm ripping out the nvidia card and sticking a substitute in to check. -- David C. Rankin, J.D., P.E. Rankin Law Firm, PLLC 510 Ochiltree Street Nacogdoches, Texas 75961 Telephone: (936) 715-9333 Facsimile: (936) 715-9339 www.rankinlawfirm.com -- To unsubscribe, e-mail: opensuse+unsubscribe@opensuse.org For additional commands, e-mail: opensuse+help@opensuse.org
On Friday 04 April 2008 09:27, David C. Rankin wrote:
...
Anders
Well, your right as usual about my understanding. What I can't get my mind around is the fact that the mce(s) go away as long as the driver isn't loaded. I guess that's possible if the driver is calling hardware that doesn't get called otherwise. However, it just seems that if the mce(s) go away if the driver isn't loaded, the the driver or the systems response to the driver was the problem and not hardware.
The driver tells the hardware, among other things, where to transfer data via DMA. Such transfers are done asynchronously and if the driver gives bad addresses or transfer sizes to such a DMA-equipped card, then that card's actions (carrying out the driver's instructions) can cause "hardware" errors.
I'm ripping out the nvidia card and sticking a substitute in to check.
-- David C. Rankin
Randall Schulz -- To unsubscribe, e-mail: opensuse+unsubscribe@opensuse.org For additional commands, e-mail: opensuse+help@opensuse.org
On Friday 04 April 2008 18:32:57 Randall R Schulz wrote:
On Friday 04 April 2008 09:27, David C. Rankin wrote:
...
Anders
Well, your right as usual about my understanding. What I can't get my mind around is the fact that the mce(s) go away as long as the driver isn't loaded. I guess that's possible if the driver is calling hardware that doesn't get called otherwise. However, it just seems that if the mce(s) go away if the driver isn't loaded, the the driver or the systems response to the driver was the problem and not hardware.
The driver tells the hardware, among other things, where to transfer data via DMA. Such transfers are done asynchronously and if the driver gives bad addresses or transfer sizes to such a DMA-equipped card, then that card's actions (carrying out the driver's instructions) can cause "hardware" errors.
I don't think that will trigger an MCE though Anders -- Madness takes its toll -- To unsubscribe, e-mail: opensuse+unsubscribe@opensuse.org For additional commands, e-mail: opensuse+help@opensuse.org
On Friday 04 April 2008 09:39, Anders Johansson wrote:
On Friday 04 April 2008 18:32:57 Randall R Schulz wrote:
On Friday 04 April 2008 09:27, David C. Rankin wrote:
...
Anders
Well, your right as usual about my understanding. What I can't get my mind around is the fact that the mce(s) go away as long as the driver isn't loaded. ...
The driver tells the hardware, among other things, where to transfer data via DMA. Such transfers are done asynchronously and if the driver gives bad addresses or transfer sizes to such a DMA-equipped card, then that card's actions (carrying out the driver's instructions) can cause "hardware" errors.
I don't think that will trigger an MCE though
Well, one thing that can happen is that interrupt vectors can be clobbered by DMA activity, after which it can become impossible for the processor to continue because it gets a fault trying to execute from bad addresses or from valid addresses that contain gibberish instructions.
Anders -- Madness takes its toll
Randall Schulz -- To unsubscribe, e-mail: opensuse+unsubscribe@opensuse.org For additional commands, e-mail: opensuse+help@opensuse.org
On Friday 04 April 2008 18:40:55 Randall R Schulz wrote:
On Friday 04 April 2008 09:39, Anders Johansson wrote:
On Friday 04 April 2008 18:32:57 Randall R Schulz wrote:
On Friday 04 April 2008 09:27, David C. Rankin wrote:
...
Anders
Well, your right as usual about my understanding. What I can't get my mind around is the fact that the mce(s) go away as long as the driver isn't loaded. ...
The driver tells the hardware, among other things, where to transfer data via DMA. Such transfers are done asynchronously and if the driver gives bad addresses or transfer sizes to such a DMA-equipped card, then that card's actions (carrying out the driver's instructions) can cause "hardware" errors.
I don't think that will trigger an MCE though
Well, one thing that can happen is that interrupt vectors can be clobbered by DMA activity, after which it can become impossible for the processor to continue because it gets a fault trying to execute from bad addresses or from valid addresses that contain gibberish instructions.
Well, it's obviously possible for software to hang a CPU, that's not in question. But I still don't think it will trigger a machine check exception. That's not the type of errors those are designed to alert about Anders -- Madness takes its toll -- To unsubscribe, e-mail: opensuse+unsubscribe@opensuse.org For additional commands, e-mail: opensuse+help@opensuse.org
On Friday 04 April 2008 09:48, Anders Johansson wrote:
On Friday 04 April 2008 18:40:55 Randall R Schulz wrote:
On Friday 04 April 2008 09:39, Anders Johansson wrote: ...
Well, one thing that can happen is that interrupt vectors can be clobbered by DMA activity, after which it can become impossible for the processor to continue because it gets a fault trying to execute from bad addresses or from valid addresses that contain gibberish instructions.
Well, it's obviously possible for software to hang a CPU, that's not in question.
But I still don't think it will trigger a machine check exception. That's not the type of errors those are designed to alert about
How does the CPU respond to a fault that occurs when it's processing a fault? Isn't that one of the things that machine-checks report?
Anders -- Madness takes its toll
Randall Schulz -- To unsubscribe, e-mail: opensuse+unsubscribe@opensuse.org For additional commands, e-mail: opensuse+help@opensuse.org
On Friday 04 April 2008 18:50:02 Randall R Schulz wrote:
On Friday 04 April 2008 09:48, Anders Johansson wrote:
On Friday 04 April 2008 18:40:55 Randall R Schulz wrote:
On Friday 04 April 2008 09:39, Anders Johansson wrote: ...
Well, one thing that can happen is that interrupt vectors can be clobbered by DMA activity, after which it can become impossible for the processor to continue because it gets a fault trying to execute from bad addresses or from valid addresses that contain gibberish instructions.
Well, it's obviously possible for software to hang a CPU, that's not in question.
But I still don't think it will trigger a machine check exception. That's not the type of errors those are designed to alert about
How does the CPU respond to a fault that occurs when it's processing a fault? Isn't that one of the things that machine-checks report?
I don't think so http://en.wikipedia.org/wiki/Machine_Check_Exception We're talking about exclusively real hardware problems Anders -- Madness takes its toll -- To unsubscribe, e-mail: opensuse+unsubscribe@opensuse.org For additional commands, e-mail: opensuse+help@opensuse.org
On Friday 04 April 2008 10:01, Anders Johansson wrote:
On Friday 04 April 2008 18:50:02 Randall R Schulz wrote:
On Friday 04 April 2008 09:48, Anders Johansson wrote:
...
How does the CPU respond to a fault that occurs when it's processing a fault? Isn't that one of the things that machine-checks report?
I don't think so
It seems to me that what it says in the "Causes" section is what I'm talking about.
We're talking about exclusively real hardware problems
Not according to the article you referenced.
Anders -- Madness takes its toll
Randall Schulz -- To unsubscribe, e-mail: opensuse+unsubscribe@opensuse.org For additional commands, e-mail: opensuse+help@opensuse.org
On Fri, Apr 4, 2008 at 9:20 AM, Anders Johansson <ajh@rydsbo.net> wrote:
You seem to be misunderstanding what "mce" is. A machine check exception is the hardware itself telling you that something has gone badly wrong. There is no interpretation involved in the software. The software just logs the message
If the mce says it is a hardware problem, you can count on its being a hardware problem
Anders
No you can't count on that Anders. Do some research on MCE errors and you will find these errors are often reported when there is absolutely nothing wrong with the machine. In fact DELL had a huge thread on their internal blog about the reporting of mce errors from linux users upon the arrival of core 2 duo machines. They were more than a little miffed getting calls because some developer of the mce package with a swollen head put in language insisting it was hardware when others clearly demonstrated you could get to that part of the code with no hardware error at all. Its quite possible for software bugs to hoze things so badly that the mce modules think there was an error. Further, part of the mce software's job is to filter out the bogus MCE errors. (or so says someone who shall remain nameless but who's email address is ak@suse.de ). Now if the software's job is to filter out bogus mc events that is a defacto assertion that lots of these events are bogus. I've seen these in the past as well. Mine had to do with runaway keys, and the clue was the bit about TSC. Dual cores can get their timers to disagree to the point that it forces a failure. You would often see this with speed-step or power-now enabled, but simply locking the machine at high-power setting would avoid the problem. For me the nohpet command line kernel parameter was required under suse 10.1. That solved all my instances. But that was on a core-2-duo. -- ----------JSA--------- -- To unsubscribe, e-mail: opensuse+unsubscribe@opensuse.org For additional commands, e-mail: opensuse+help@opensuse.org
Anders Johansson wrote:
On Friday 04 April 2008 17:15:45 David C. Rankin wrote:
Carlos E. R. wrote:
The report has some points not clear.
For one thing, if the problem occurs _only_ when the nvidia driver loaded, the one made by nvidia, then you will have a very hard time convincing anyone to even look at it. It is closed source! The kernel puts a message saying "Tainted!", and developers will not touch it as if it were leprous.
They have their reasons and I wont argue them, nor who is to blame, but it is a sad situation for users, who certainly are not to be blamed. This is like a city transport strike: they go against the management, but it is the users who suffer it. Carlos,
If they had given that answer I would be 100% happy with it. The problem I have with the response I received is that it smelled very much like a George Bush justification for closing the bug. You know the type, an explanation that just can't be trusted.
You seem to be misunderstanding what "mce" is. A machine check exception is the hardware itself telling you that something has gone badly wrong. There is no interpretation involved in the software. The software just logs the message
If the mce says it is a hardware problem, you can count on its being a hardware problem
Anders
Anders, There is some very interesting reading about this _hardware_ error I'm experiencing. It seems the exact _hardware_ error I'm experiencing what actually a _software_ error in the x86_64 code base that was "fixed" a while back. I sure seems suspicious that I am now getting these same errors. I've seen it before and it's not too uncommon for a subsequent change to another part of the code to "unfix" an earlier solution. How do I go about finding out if this is one of those potential situations. This is what I really thought the developers would do when I filed the bug. How does a mere mortal go about such a task? For reference see: https://www.x86-64.org/pipermail/patches/2004-June.txt.gz -- David C. Rankin, J.D., P.E. Rankin Law Firm, PLLC 510 Ochiltree Street Nacogdoches, Texas 75961 Telephone: (936) 715-9333 Facsimile: (936) 715-9339 www.rankinlawfirm.com -- To unsubscribe, e-mail: opensuse+unsubscribe@opensuse.org For additional commands, e-mail: opensuse+help@opensuse.org
Anders Johansson wrote:
On Friday 04 April 2008 17:15:45 David C. Rankin wrote:
Carlos E. R. wrote:
The report has some points not clear.
For one thing, if the problem occurs _only_ when the nvidia driver loaded, the one made by nvidia, then you will have a very hard time convincing anyone to even look at it. It is closed source! The kernel puts a message saying "Tainted!", and developers will not touch it as if it were leprous.
They have their reasons and I wont argue them, nor who is to blame, but it is a sad situation for users, who certainly are not to be blamed. This is like a city transport strike: they go against the management, but it is the users who suffer it. Carlos,
If they had given that answer I would be 100% happy with it. The problem I have with the response I received is that it smelled very much like a George Bush justification for closing the bug. You know the type, an explanation that just can't be trusted.
You seem to be misunderstanding what "mce" is. A machine check exception is the hardware itself telling you that something has gone badly wrong. There is no interpretation involved in the software. The software just logs the message
If the mce says it is a hardware problem, you can count on its being a hardware problem
Anders
I also found this interesting: Machine check handling on Linux Andi Kleen SUSE Labs ak@suse.de Aug 2004 Sources of machine checks can be the CPU, PCI IO1, memory, caches, internal busses. The errors can be corrected errors (only logged to registers, no exception) or uncorrected errors (exception happens, software must react). When PCI IO errors are enabled "machine checks could be also caused by software bugs in drivers". (however normally not the case) see: http://www.halobates.de/mce.pdf -- David C. Rankin, J.D., P.E. Rankin Law Firm, PLLC 510 Ochiltree Street Nacogdoches, Texas 75961 Telephone: (936) 715-9333 Facsimile: (936) 715-9339 www.rankinlawfirm.com -- To unsubscribe, e-mail: opensuse+unsubscribe@opensuse.org For additional commands, e-mail: opensuse+help@opensuse.org
Anders Johansson wrote:
On Friday 04 April 2008 17:15:45 David C. Rankin wrote:
You seem to be misunderstanding what "mce" is. A machine check exception is the hardware itself telling you that something has gone badly wrong. There is no interpretation involved in the software. The software just logs the message
If the mce says it is a hardware problem, you can count on its being a hardware problem
Anders
Jan, Anders, List: The more I read, and the more I test, the more I am concerned that there may be a simmering issue with the x86_64 code. I installed a plain-jan pci-e ATI card running with the open source driver. Just as with the nvidia 8600GT card (using the opensource "nv" driver), the system still gives occasional MCEs. Just as with the 8600, the MCEs do not have any affect on the system. If I wasn't logging them with mcelog, I would never know they were occurring. Reading the tech-docs, it is readily apparent that MCE doesn't necessarily mean hardware. Software is more than capable of causing them: AMD64 Architecture Programmer’s Manual Volume 2: System Programming 2.6.6 New Exception Conditions "The AMD64 architecture defines a number of new conditions that can cause an exception to occur when the processor is running in long mode. Many of the conditions occur when software attempts to use an address that is not in canonical form. See “Vectors” on page 208 for information on the new exception conditions that can occur in long mode." See:http://www.amd.com/us-en/assets/content_type/white_papers_and_tech_docs/2459... See Also: AMD64 - http://www.amd.com/us-en/Processors/TechnicalResources/0,,30_182_739_7044,00... Opteron Specific - http://www.amd.com/us-en/Processors/TechnicalResources/0,,30_182_739_9003,00... My question is, "What type of additional logging or data capture should I be doing in hopes of catching or narrowing down what the real cause of the MCE is?" I'm running and capturing the MCEs with mcelog running every minute under cron to insure I buffers never get filled. But beyond that, I'm not doing any other special logging. The only hardware I haven't changed is the motherboard and that tests fine. What else could I run/log/set that would give me the best change of finding the real culprit. Any help is much appreciated. -- David C. Rankin, J.D., P.E. Rankin Law Firm, PLLC 510 Ochiltree Street Nacogdoches, Texas 75961 Telephone: (936) 715-9333 Facsimile: (936) 715-9339 www.rankinlawfirm.com -- To unsubscribe, e-mail: opensuse+unsubscribe@opensuse.org For additional commands, e-mail: opensuse+help@opensuse.org
David C. Rankin wrote:
I installed a plain-jan pci-e ATI card running with the open source driver. Just as with the nvidia 8600GT card (using the opensource "nv" driver), the system still gives occasional MCEs. Just as with the 8600, the MCEs do not have any affect on the system. If I wasn't logging them with mcelog, I would never know they were occurring.
It sound like you now have a means of reproducing the problem with two flavours of hardware and with only open-source software loaded. If that's so, I'd suggest opening a new bugzilla and only mentioning those configurations and logs. I would expect Novell to investigate that, but if it doesn't affect your system I don't suppose it will be high priority ! Cheers, Dave -- To unsubscribe, e-mail: opensuse+unsubscribe@opensuse.org For additional commands, e-mail: opensuse+help@opensuse.org
Dave Howorth wrote:
David C. Rankin wrote:
I installed a plain-jan pci-e ATI card running with the open source driver. Just as with the nvidia 8600GT card (using the opensource "nv" driver), the system still gives occasional MCEs. Just as with the 8600, the MCEs do not have any affect on the system. If I wasn't logging them with mcelog, I would never know they were occurring.
It sound like you now have a means of reproducing the problem with two flavours of hardware and with only open-source software loaded. If that's so, I'd suggest opening a new bugzilla and only mentioning those configurations and logs. I would expect Novell to investigate that, but if it doesn't affect your system I don't suppose it will be high priority !
Cheers, Dave
Well Dave, While not conceding any aspect of the point regarding the "hardware v. software" source of the problem, I have shrunk from the immediate software battle and resorted to the brute force approach. With only 2 possible remaining candidates as the source of any "hardware" error, the mb is on its way back to Tyan. After the replacement is received, if errors continue, (I wager they will) the opteron is going back to AMD, and then if the problems continue, we will remount the software attack by reopening the bug report. At that point it will either be a true hardware incompatibility or a true bug in the x86-64 code in dealing with that hardware combination. Thanks to all for the great discussion on this thread. I will start a new thread as a continuation after all the new hardware is received. By the way, much of the content at 3111skyline is off-line until the hardware returns. Just in case anyone finds a broken link from the prior posts. -- David C. Rankin, J.D., P.E. Rankin Law Firm, PLLC 510 Ochiltree Street Nacogdoches, Texas 75961 Telephone: (936) 715-9333 Facsimile: (936) 715-9339 www.rankinlawfirm.com -- To unsubscribe, e-mail: opensuse+unsubscribe@opensuse.org For additional commands, e-mail: opensuse+help@opensuse.org
On Friday 04 April 2008 08:20:02 David C. Rankin wrote:
List,
How many of you have ever gotten the feeling that some of the developers with Novell bugzilla work harder trying to punt bugs than they do actually trying to understand whether a valid problem exists or not? I have been frustrated a number of times when gone the extra mile to provided a detailed documented bug submission, only to have someone at Novell try to punt the bug when it is apparent that not one iota of investigation has been done sufficient to either rule in or rule out a problem.
This is very much not true. If there is a software problem in a core package - especially the kernel, it will get attention, and lots of it In this case however, it just looks like accessing the graphics card in any form more advanced than a frame buffer triggers a hardware error. As I mentioned in the other mail, the software doesn't analyse anything, this message is coming from the hardware itself. If the message is wrong, the hardware vendor is still to blame, for sending bad machine check exceptions Anders -- Madness takes its toll -- To unsubscribe, e-mail: opensuse+unsubscribe@opensuse.org For additional commands, e-mail: opensuse+help@opensuse.org
participants (9)
-
Anders Johansson
-
Benji Weber
-
Carlos E. R.
-
Dave Howorth
-
David C. Rankin
-
Jan Engelhardt
-
John Andersen
-
Randall R Schulz
-
Richard Creighton