Re: [opensuse] Unstable system - culprit identified
Per Jessen wrote:
David C. Rankin wrote:
Per Jessen wrote:
Hi David
I think I did reply (a little late though) on-list, but it seems to me that the key thing is that you're not doing anything to provoke the lockups.
Per,
I have been doing a little more investigation under the guidance of, and with the help of, master kernel builder, Sir Engelhardt. (I hope I remembered that right ;-)
Nope, I'm sure he's German so he can't be a Sir - maybe a Herr? - sounds very similar :-)
One issue that looks promising as the culprit is the nvidia module. Were you by chance also loading the nvidia module on your Gigabyte system? What video card were you using?
No, I've got an ATI Radeon card and I'm using the AMD drivers.
Also, what is a good torture test to run to see if I can make the system lock. IIRC you were using mprime. Any other simple ones you know of? Thanks.
mprime is the best stress test I know. It just seems to be able to get into all the corners where you'd normally never go.
/Per
Well we have plenty of mcelog errors before removing the nvidia driver and using the stock "nv" driver, we have not seen any since. That's running mprime while running XP with it downloading and installing updates as well. The combination of removing the nvidia driver and passing "acpi_use_timer_override" seems to have taken care of 99% of the problem. However the mce errors are hardware errors, so it looks like the nvidia 8600GT card causes real problems when the full proprietary nvidia kernel module is loaded. Hmm, no more compiz until this is resolved. Thanks for your response -- David C. Rankin, J.D., P.E. Rankin Law Firm, PLLC 510 Ochiltree Street Nacogdoches, Texas 75961 Telephone: (936) 715-9333 Facsimile: (936) 715-9339 www.rankinlawfirm.com -- To unsubscribe, e-mail: opensuse+unsubscribe@opensuse.org For additional commands, e-mail: opensuse+help@opensuse.org
David C. Rankin wrote:
Per Jessen wrote:
David C. Rankin wrote:
Per Jessen wrote:
Hi David
I think I did reply (a little late though) on-list, but it seems to me that the key thing is that you're not doing anything to provoke the lockups.
Per,
I have been doing a little more investigation under the guidance of, and with the help of, master kernel builder, Sir Engelhardt. (I hope I remembered that right ;-)
Nope, I'm sure he's German so he can't be a Sir - maybe a Herr? - sounds very similar :-)
One issue that looks promising as the culprit is the nvidia module. Were you by chance also loading the nvidia module on your Gigabyte system? What video card were you using?
No, I've got an ATI Radeon card and I'm using the AMD drivers.
Also, what is a good torture test to run to see if I can make the system lock. IIRC you were using mprime. Any other simple ones you know of? Thanks.
mprime is the best stress test I know. It just seems to be able to get into all the corners where you'd normally never go.
/Per
Well we have plenty of mcelog errors before removing the nvidia driver and using the stock "nv" driver, we have not seen any since. That's running mprime while running XP with it downloading and installing updates as well. The combination of removing the nvidia driver and passing "acpi_use_timer_override" seems to have taken care of 99% of the problem. However the mce errors are hardware errors, so it looks like the nvidia 8600GT card causes real problems when the full proprietary nvidia kernel module is loaded. Hmm, no more compiz until this is resolved.
Thanks for your response
For those of you that recall the thread, I thought I would provide the list with a closing chapter in the MCE hell I went through with the latest Tyan S2856ANRF and Opteron 180 box I built. After struggling for weeks with "machine check events" and replacing virtually everything in the box, trying both nVidia and ATI video cards (with and without the propriety drivers), rma'ing the ram back to OCZ, I finally rma'ed the motherboard back to Tyan on 4/12. I received a replacement (not new) board back last Friday and rebuilt the system. So far it has been running without a single mce through all matters or torture. (mprime -t, etc.) The primary torture that would cause mce's before replacement was accessing a vnc session and starting virtual box with a copy of winXP running across the remote vnc session. That works just fine now without a singe mce. So I guess, case closed. It was a faulty motherboard. Thanks again for all those that helped with the diagnosis. -- David C. Rankin, J.D., P.E. Rankin Law Firm, PLLC 510 Ochiltree Street Nacogdoches, Texas 75961 Telephone: (936) 715-9333 Facsimile: (936) 715-9339 www.rankinlawfirm.com -- To unsubscribe, e-mail: opensuse+unsubscribe@opensuse.org For additional commands, e-mail: opensuse+help@opensuse.org
On Tue, May 6, 2008 at 11:47 AM, David C. Rankin <drankinatty@suddenlinkmail.com> wrote:
So I guess, case closed. It was a faulty motherboard. Thanks again for all those that helped with the diagnosis.
That is what was told to you when the bug was closed, you should have listened then and saved everyone time, including valuable time kernel developers like GKH wasted on that bug[1] Logs don't lie and devs closing bugs know what they are doing ;) Cheers -J [1] https://bugzilla.novell.com/show_bug.cgi?id=376165#c10 -- To unsubscribe, e-mail: opensuse+unsubscribe@opensuse.org For additional commands, e-mail: opensuse+help@opensuse.org
CyberOrg wrote:
On Tue, May 6, 2008 at 11:47 AM, David C. Rankin <drankinatty@suddenlinkmail.com> wrote:
So I guess, case closed. It was a faulty motherboard. Thanks again for all those that helped with the diagnosis.
That is what was told to you when the bug was closed, you should have listened then and saved everyone time, including valuable time kernel developers like GKH wasted on that bug[1]
Logs don't lie and devs closing bugs know what they are doing ;)
Cheers
-J
That's precisely what I did. I'm sure the last time you had mce problems, you knew right off the bat that it was the motherboard, not the memory, not the memory controller, not the software that CAN incorrectly flag mces, not the video hardware or driver, etc.... I tore every piece of hardware apart on this machine. Surely you're not saying I ignored the possibility it was hardware, are you? The devs were right this time and I graciously told them so. See: https://bugzilla.novell.com/show_bug.cgi?id=376165. However, that it not always the case as we learned in spades pulling back the ugly cover on the mce error reporting scheme. -- David C. Rankin, J.D., P.E. Rankin Law Firm, PLLC 510 Ochiltree Street Nacogdoches, Texas 75961 Telephone: (936) 715-9333 Facsimile: (936) 715-9339 www.rankinlawfirm.com -- To unsubscribe, e-mail: opensuse+unsubscribe@opensuse.org For additional commands, e-mail: opensuse+help@opensuse.org
CyberOrg wrote:
On Tue, May 6, 2008 at 11:47 AM, David C. Rankin <drankinatty@suddenlinkmail.com> wrote:
So I guess, case closed. It was a faulty motherboard. Thanks again for all those that helped with the diagnosis.
That is what was told to you when the bug was closed, you should have listened then and saved everyone time, including valuable time kernel developers like GKH wasted on that bug[1]
Logs don't lie and devs closing bugs know what they are doing ;)
In general, yes. But not always. I've seen tickets closed (in both development and support contexts) not because a problem is solved, but because someone is sick of working on the problem, and hoping that it has gone away (i.e. they'll work on it if someone opens up a new ticket). One factor which makes this happen usually in support situations is when some pointy-heads have arbitrarily designated some sort of contractual obligation to get X% of problems solved in Y amount of time...and a ticket open for several days "screws up our performance metrics", and therefore the contract. -- To unsubscribe, e-mail: opensuse+unsubscribe@opensuse.org For additional commands, e-mail: opensuse+help@opensuse.org
On Tue, 06 May 2008 12:33:40 -0400, Sam Clemens wrote:
when some pointy-heads have arbitrarily designated some sort of contractual obligation to get X% of problems solved in Y amount of time...and a ticket open for several days "screws up our performance metrics", and therefore the contract.
In Germany the analysis of tools data for the purpose of evaluating employee performance can only happen if the works council agrees. Even the use of any tool that can possibly be used to analyze employee performance in a company has to be negotiated with the works council. But you'll typically find such pointy-head stuff in support and not in R&D. Philipp -- To unsubscribe, e-mail: opensuse+unsubscribe@opensuse.org For additional commands, e-mail: opensuse+help@opensuse.org
-----BEGIN PGP SIGNED MESSAGE----- Hash: SHA1 The Thursday 2008-05-08 at 11:56 +0200, Philipp Thomas wrote:
On Tue, 06 May 2008 12:33:40 -0400, Sam Clemens wrote:
when some pointy-heads have arbitrarily designated some sort of contractual obligation to get X% of problems solved in Y amount of time...and a ticket open for several days "screws up our performance metrics", and therefore the contract.
In Germany the analysis of tools data for the purpose of evaluating employee performance can only happen if the works council agrees. Even the use of any tool that can possibly be used to analyze employee performance in a company has to be negotiated with the works council.
But you'll typically find such pointy-head stuff in support and not in R&D.
You do not need to evaluate employee perfomance. You simply look at the ticket lenght statistics, and if it is longer than what the support contract specifies, you go and search the long tickets, those that are opened for three months, have a brief look at them, and close them directly. Or less brutally, you do a team brainstorming session to have a go at those tickets and try to close them that day. There is no need to even try to put blame on somebody. It is assumed that they haven't been closed because it has been impossible. No matter. Closed now. Like ticket is left open because you are waiting for an answer from somebody on some other country who hasn't said a word in three months... ok, close as "insuficient data". Or close as "probably solved", tell call center to ask client to try again. If it doesn't work, it will be another ticket. Tricks! ;-) - -- Cheers, Carlos E. R. -----BEGIN PGP SIGNATURE----- Version: GnuPG v2.0.4-svn0 (GNU/Linux) iD8DBQFIItoVtTMYHG2NR9URAiQsAJ9hRCwUVbJGzznCImarTu/pc1z0RgCfft5J JqPh2IcPnEpWQmSVnybbw7I= =7vfN -----END PGP SIGNATURE----- -- To unsubscribe, e-mail: opensuse+unsubscribe@opensuse.org For additional commands, e-mail: opensuse+help@opensuse.org
On Thu, 8 May 2008 12:46:44 +0200 (CEST), Carlos E. R. wrote:
You do not need to evaluate employee perfomance. You simply look at the ticket lenght statistics, and if it is longer than what the support contract specifies, you go and search the long tickets, those that are opened for three months, have a brief look at them, and close them directly.
.Strictly speaking, even this would require negotiation with works council Philipp -- To unsubscribe, e-mail: opensuse+unsubscribe@opensuse.org For additional commands, e-mail: opensuse+help@opensuse.org
Philipp Thomas wrote:
On Thu, 8 May 2008 12:46:44 +0200 (CEST), Carlos E. R. wrote:
You do not need to evaluate employee perfomance. You simply look at the ticket lenght statistics, and if it is longer than what the support contract specifies, you go and search the long tickets, those that are opened for three months, have a brief look at them, and close them directly.
.Strictly speaking, even this would require negotiation with works council
The scenario I presented has nothing to do with that. It's merely company A contractually agreeing to meet certain performance levels in the service they provide to company B. Company A can meet that however they choose (throwing lots of admins at the problem...or paying for highly experienced admins...or both...depending on the situation). It has nothing to do with socialist "works councils," as it has no impact on the worker's environment or how he is treated -- at least it shouldn't be. It's all about management policy (and whatever stupid agreements they make among themselves).
Philipp
-- To unsubscribe, e-mail: opensuse+unsubscribe@opensuse.org For additional commands, e-mail: opensuse+help@opensuse.org
On Thu, 08 May 2008 18:37:25 -0400, Sam Clemens wrote:
It has nothing to do with socialist "works councils,"
Since when are unions or works council socialist? It's the law in Germany that grants employees certain rights and its the job of elected council members to see that those rights are respected.
it has no impact on the worker's environment or how he is treated -- at least it shouldn't be.
It always will impact an employees rating. It's like managers rating your work as a programmer purely based on lines of code.
It's all about management policy (and whatever stupid agreements they make among themselves).
And then it ends up being part of a pointy-heads objectives and thus has influence on employees. But I'm getting far to carried away and this is way off topic so this will be my last mail on the subject here. Philipp -- To unsubscribe, e-mail: opensuse+unsubscribe@opensuse.org For additional commands, e-mail: opensuse+help@opensuse.org
Philipp Thomas wrote:
On Thu, 08 May 2008 18:37:25 -0400, Sam Clemens wrote:
It has nothing to do with socialist "works councils,"
Since when are unions or works council socialist? It's the law in Germany that grants employees certain rights and its the job of elected ^^^^^^^ If it's in Germany, it's socialist.
Why has no free market economy ever needed such things?
council members to see that those rights are respected.
Closing tickets has nothing to do with workers rights. Get a clue. -- To unsubscribe, e-mail: opensuse+unsubscribe@opensuse.org For additional commands, e-mail: opensuse+help@opensuse.org
-----BEGIN PGP SIGNED MESSAGE----- Hash: SHA1 The Friday 2008-05-09 at 00:17 +0200, Philipp Thomas wrote:
On Thu, 8 May 2008 12:46:44 +0200 (CEST), Carlos E. R. wrote:
You do not need to evaluate employee perfomance. You simply look at the ticket lenght statistics, and if it is longer than what the support contract specifies, you go and search the long tickets, those that are opened for three months, have a brief look at them, and close them directly.
.Strictly speaking, even this would require negotiation with works council
I suppose it depends on your legislation. Here, I have done it more than once: we didn't evaluate employees, we simple removed tickets fast, as a team. It's a team job, after all: the one responsible is the team boss, if any. - -- Cheers, Carlos E. R. -----BEGIN PGP SIGNATURE----- Version: GnuPG v2.0.4-svn0 (GNU/Linux) iD8DBQFII5h1tTMYHG2NR9URAlUDAJ45R7mT/DnKfO5KiJdyt7aKUj/3YACcDPuc EI8CHQkTcNUaiY6BpCRE62s= =PVdA -----END PGP SIGNATURE----- -- To unsubscribe, e-mail: opensuse+unsubscribe@opensuse.org For additional commands, e-mail: opensuse+help@opensuse.org
On Fri, 9 May 2008 02:18:58 +0200 (CEST), Carlos E. R. wrote:
we didn't evaluate employees, we simple removed tickets fast, as a team. It's a team job, after all: the one responsible is the team boss, if any.
As I said: its normal for support to act so. And of cause developers are reminded to act fast on bug reports. But I've not encountered any contractual agreements regarding bug reports. Philipp -- To unsubscribe, e-mail: opensuse+unsubscribe@opensuse.org For additional commands, e-mail: opensuse+help@opensuse.org
-----BEGIN PGP SIGNED MESSAGE----- Hash: SHA1 The Friday 2008-05-09 at 10:46 +0200, Philipp Thomas wrote:
On Fri, 9 May 2008 02:18:58 +0200 (CEST), Carlos E. R. wrote:
we didn't evaluate employees, we simple removed tickets fast, as a team. It's a team job, after all: the one responsible is the team boss, if any.
As I said: its normal for support to act so. And of cause developers are reminded to act fast on bug reports. But I've not encountered any contractual agreements regarding bug reports.
Sorry, I got too carried away; I didn't mean to say that this is the case with suse. I spoke of the general case, something that happens when tasks are divided between companies with contracts and such, specially if there not a "good spirit" o "friendship", I don't know the exact word. Maybe it depends on the business culture of each country. I have lived that situation from all ends: as plain employee, team chief, and client. Ie, company A hires company B to handle the ticketing, and in order to specify the degree of work involved, they have to ensure that tickets have to be handled, say, before 24 hour passes. If there is little money, 'B' will be tempted to cheat and stick to the letter of the contract, or close tickets without real cause. But as you say, we are going too offtopic O:-) - -- Cheers, Carlos E. R. -----BEGIN PGP SIGNATURE----- Version: GnuPG v2.0.4-svn0 (GNU/Linux) iD8DBQFIJDAMtTMYHG2NR9URAlSJAJ4wjf+UoceEBzuGBxyEpFC3mqnTNwCfVKqU GI55F5hOaAGII6UTr1qIy9s= =pCV6 -----END PGP SIGNATURE----- -- To unsubscribe, e-mail: opensuse+unsubscribe@opensuse.org For additional commands, e-mail: opensuse+help@opensuse.org
Philipp Thomas wrote:
On Thu, 8 May 2008 12:46:44 +0200 (CEST), Carlos E. R. wrote:
You do not need to evaluate employee perfomance. You simply look at the ticket lenght statistics, and if it is longer than what the support contract specifies, you go and search the long tickets, those that are opened for three months, have a brief look at them, and close them directly.
.Strictly speaking, even this would require negotiation with works council
I think that must be _very_ strictly speaking - when I worked in Germany, it was certainly not an uncommon process, and the works council did care. /Per Jessen, Zürich -- To unsubscribe, e-mail: opensuse+unsubscribe@opensuse.org For additional commands, e-mail: opensuse+help@opensuse.org
Per Jessen wrote:
I think that must be _very_ strictly speaking - when I worked in Germany, it was certainly not an uncommon process, and the works council did care.
that should obviously have been a "didn't". /Per Jessen, Zürich -- To unsubscribe, e-mail: opensuse+unsubscribe@opensuse.org For additional commands, e-mail: opensuse+help@opensuse.org
* Per Jessen (per@computer.org) [20080516 11:30]:
I think that must be _very_ strictly speaking
It was.
when I worked in Germany, it was certainly not an uncommon process, and the works council did care.
Of cause, you won't make fuss for such a small matter. Philipp -- To unsubscribe, e-mail: opensuse+unsubscribe@opensuse.org For additional commands, e-mail: opensuse+help@opensuse.org
Sam Clemens escribió:
One factor which makes this happen usually in support situations is when some pointy-heads have arbitrarily designated some sort of contractual obligation to get X% of problems solved in Y amount of time...and a ticket open for several days "screws up our performance metrics", and therefore the contract.
This , AFAICS, does not happend in R&D at all, and I think that managers will **not** be happy if they find this behaviuor..(aka, they will start "kicking asses" :-P ) You cannot make a company sustainable in anyway if you measure the progress "per report closed" as your main indicator, that is like measuring software progress by Lines of code... -- "Progress is possible only if we train ourselves to think about programs without thinking of them as pieces of executable code.” - Edsger W. Dijkstra Cristian Rodríguez R. Platform/OpenSUSE - Core Services SUSE LINUX Products GmbH Research & Development http://www.opensuse.org/
Is it possible this thread has wandered off topic far enough to be taken to the offtopic list? -- ----------JSA--------- -- To unsubscribe, e-mail: opensuse+unsubscribe@opensuse.org For additional commands, e-mail: opensuse+help@opensuse.org
John Andersen wrote:
Is it possible this thread has wandered off topic far enough to be taken to the offtopic list?
RMA'ed Tyan Tomcat K8E S2865ANRF Socket939 with Opteron 180 now up 7 days not a single MCE ;-) -- David C. Rankin, J.D., P.E. Rankin Law Firm, PLLC 510 Ochiltree Street Nacogdoches, Texas 75961 Telephone: (936) 715-9333 Facsimile: (936) 715-9339 www.rankinlawfirm.com -- To unsubscribe, e-mail: opensuse+unsubscribe@opensuse.org For additional commands, e-mail: opensuse+help@opensuse.org
Cristian Rodríguez wrote:
Sam Clemens escribió:
One factor which makes this happen usually in support situations is when some pointy-heads have arbitrarily designated some sort of contractual obligation to get X% of problems solved in Y amount of time...and a ticket open for several days "screws up our performance metrics", and therefore the contract.
This , AFAICS, does not happend in R&D at all, and I think that managers will **not** be happy if they find this behaviuor..(aka, they will start "kicking asses" :-P )
Who said anything about R&D?
You cannot make a company sustainable in anyway if you measure the progress "per report closed" as your main indicator, that is like measuring software progress by Lines of code...
Nevertheless, EDS and HP both cook up crazy contracts like that with the various automotive companies. Maybe if I had said "pointy-haired bosses" it would have been clearer. -- To unsubscribe, e-mail: opensuse+unsubscribe@opensuse.org For additional commands, e-mail: opensuse+help@opensuse.org
-----BEGIN PGP SIGNED MESSAGE----- Hash: SHA1 The Thursday 2008-05-08 at 22:25 -0400, Cristian Rodríguez wrote:
You cannot make a company sustainable in anyway if you measure the progress "per report closed" as your main indicator, that is like measuring software progress by Lines of code...
You are right, but it happens. And yes, my progress as programmer was measured in terms of "this should be working months ago", said by somebody that is not a programmer, even close to, nor can evaluate the amount of work a certain programming job really involves (try to maintain deliberately obfuscated code by a previous gruntled programmer...). I'm sure many have been in similar situations. In the end, it is not sustainable. Till that end, they subsist... causing old kind of problems and grief. So... I can't help much if I evaluate others by my own experience. Maybe not fair, but unavoidable :-) Meaning that I have suffered the "average time per report closed" as a "normal" evaluation method, so I'm biassed to think everybody does the same. - -- Cheers, Carlos E. R. -----BEGIN PGP SIGNATURE----- Version: GnuPG v2.0.4-svn0 (GNU/Linux) iD8DBQFIJDNhtTMYHG2NR9URArGNAJ41D2S492Uc1FQ5gTJCrWnkpSZ4ugCfWWtP upX3iIlEldVYOXYUzXdntPM= =luEG -----END PGP SIGNATURE-----
David C. Rankin wrote:
David C. Rankin wrote:
Per Jessen wrote:
David C. Rankin wrote:
Per Jessen wrote:
Hi David
I think I did reply (a little late though) on-list, but it seems to me that the key thing is that you're not doing anything to provoke the lockups.
Per,
I have been doing a little more investigation under the guidance of, and with the help of, master kernel builder, Sir Engelhardt. (I hope I remembered that right ;-)
Nope, I'm sure he's German so he can't be a Sir - maybe a Herr? - sounds very similar :-)
One issue that looks promising as the culprit is the nvidia module. Were you by chance also loading the nvidia module on your Gigabyte system? What video card were you using?
No, I've got an ATI Radeon card and I'm using the AMD drivers.
Also, what is a good torture test to run to see if I can make the system lock. IIRC you were using mprime. Any other simple ones you know of? Thanks.
mprime is the best stress test I know. It just seems to be able to get into all the corners where you'd normally never go.
/Per
Well we have plenty of mcelog errors before removing the nvidia driver and using the stock "nv" driver, we have not seen any since. That's running mprime while running XP with it downloading and installing updates as well. The combination of removing the nvidia driver and passing "acpi_use_timer_override" seems to have taken care of 99% of the problem. However the mce errors are hardware errors, so it looks like the nvidia 8600GT card causes real problems when the full proprietary nvidia kernel module is loaded. Hmm, no more compiz until this is resolved.
Thanks for your response
For those of you that recall the thread, I thought I would provide the list with a closing chapter in the MCE hell I went through with the latest Tyan S2856ANRF and Opteron 180 box I built. After struggling for weeks with "machine check events" and replacing virtually everything in the box, trying both nVidia and ATI video cards (with and without the propriety drivers), rma'ing the ram back to OCZ, I finally rma'ed the motherboard back to Tyan on 4/12.
I received a replacement (not new) board back last Friday and rebuilt the system. So far it has been running without a single mce through all matters or torture. (mprime -t, etc.) The primary torture that would cause mce's before replacement was accessing a vnc session and starting virtual box with a copy of winXP running across the remote vnc session. That works just fine now without a singe mce.
So I guess, case closed. It was a faulty motherboard. Thanks again for all those that helped with the diagnosis.
Glad to hear that Dave P -- To unsubscribe, e-mail: opensuse+unsubscribe@opensuse.org For additional commands, e-mail: opensuse+help@opensuse.org
participants (10)
-
Carlos E. R.
-
Cristian Rodríguez
-
CyberOrg
-
Dave Plater
-
David C. Rankin
-
John Andersen
-
Per Jessen
-
Philipp Thomas
-
Philipp Thomas
-
Sam Clemens