False "lost ticks" on dual-Opteron system (=> timer twice as fast)
Hi, I've recently set up a dual Opteron RAID server (AMD-8000-based Tyan Thunder K8S Pro SCSI board, 2 246 Opterons, stepping 10). Kernel is a modified 2.6.11.4-20a from SuSE 9.3 (SMP version, sure). The Opterons are capable of changing the CPU frequency (between 1GHz and 2GHz). The system clock runs (on average) about twice as fast as it should be. A closer observation revealed that the clock jumps forward by about 10-30 seconds every 10-30 seconds (plus other oddities, including backward clock jumps). The timer interrupts are distributed roughly evenly among the two CPUs, but looking at the timer interrupt number (grep timer /proc/interrupts) revealed that for about 10-30 seconds, one CPU gets the interrupt, and then the other CPU gets them; the transition causes the system clock to advance. A quick look at timer_interrupt shows what I suspect is the culprit: Each CPU keeps track of the last TSC at a timer interrupt, and adds the "lost" ticks to jiffies when perceived necessary. If there's only a single jiffies, but two vxtime.last_tsc, it can't work. A quick workaround would be to ditch the handling of the "lost" jiffies. I still anticipate to have annoying time skews by do_gettimeoffset() (that's what explains the other oddities - if I do gettimeofday() on the CPU that isn't getting interrupts, I'll going to add the "lost" jiffies, too). A proposed fix would be to *also* store the last jiffies value in the vxtime variable, and verify if it's really *this* CPU that did miss the timer interrupts. This local "last-stored-jiffies" can help do_gettimeoffset() to calculate the local time good enough on both CPUs. What I can't believe is that I'm the only one who has this problem. <rant>I know the timer system on an Intel or AMD system is broken by design, because there should be a single constant-clocked atomically read-only system-wide timer. But this is no excuse for that ;-).</rant> -- Bernd Paysan "If you want it done right, you have to do it yourself" http://www.jwdt.com/~paysan/
On Sun, May 08, 2005 at 02:45:20PM +0200, Bernd Paysan wrote:
Hi,
I've recently set up a dual Opteron RAID server (AMD-8000-based Tyan Thunder K8S Pro SCSI board, 2 246 Opterons, stepping 10). Kernel is a modified 2.6.11.4-20a from SuSE 9.3 (SMP version, sure). The Opterons are capable of changing the CPU frequency (between 1GHz and 2GHz).
Your system should be using the HPET timer to work exactly around this. AMD 8000 has HPET. Can you post a boot.log?
The system clock runs (on average) about twice as fast as it should be. A closer observation revealed that the clock jumps forward by about 10-30 seconds every 10-30 seconds (plus other oddities, including backward clock jumps). The timer interrupts are distributed roughly evenly among the two CPUs, but looking at the timer interrupt number (grep timer /proc/interrupts) revealed that for about 10-30 seconds, one CPU gets the interrupt, and then the other CPU gets them; the transition causes the system clock to advance.
A quick look at timer_interrupt shows what I suspect is the culprit: Each CPU keeps track of the last TSC at a timer interrupt, and adds the
No, it doesn't. TSC is kept only globally right now. Obviously that is problem if the TSCs run at different frequencies (it actually is a problem even without powernow, just a much smaller one), but that is why HPET is used instead. There are some plans to change that in the future, but it hasn't happened yet.
"lost" ticks to jiffies when perceived necessary. If there's only a single jiffies, but two vxtime.last_tsc, it can't work.
A quick workaround would be to ditch the handling of the "lost" jiffies. I still anticipate to have annoying time skews by do_gettimeoffset() (that's what explains the other oddities - if I do gettimeofday() on the CPU that isn't getting interrupts, I'll going to add the "lost" jiffies, too). A proposed fix would be to *also* store the last jiffies value in the vxtime variable, and verify if it's really *this* CPU that did miss the timer interrupts. This local "last-stored-jiffies" can help do_gettimeoffset() to calculate the local time good enough on both CPUs.
The current design is that only the BP runs the main timer, and the other CPUs use the APIC timer and don't do any own time keeping. I think you misread the code quite a bit. And lost jiffie handling can't be dropped no. A common problem however is that the irq 0 is misrouted somehow, and gets broadcasted and processed on multiple CPUs. That results in the time running far too fast. You can check that by looking at /proc/interrupts. -Andi
On Sunday 08 May 2005 15:40, Andi Kleen wrote:
Your system should be using the HPET timer to work exactly around this. AMD 8000 has HPET. Can you post a boot.log?
Will come tomorrow - I don't sit right at the machine, and while trying to figure out what happens, I accidentally shut it down or caused it to crash (I can't log in remotely ATM).
The current design is that only the BP runs the main timer, and the other CPUs use the APIC timer and don't do any own time keeping. I think you misread the code quite a bit.
And lost jiffie handling can't be dropped no.
A common problem however is that the irq 0 is misrouted somehow, and gets broadcasted and processed on multiple CPUs. That results in the time running far too fast. You can check that by looking at /proc/interrupts.
Yes, that's sort of what's happening. /proc/interrupts shows that all CPUs overall get an even share of IRQ 0 - but each IRQ0 is processed by just one CPU. How can I examine and set the interrupt routing? -- Bernd Paysan "If you want it done right, you have to do it yourself" http://www.jwdt.com/~paysan/
On Sunday 08 May 2005 15:40, Andi Kleen wrote:
Your system should be using the HPET timer to work exactly around this. AMD 8000 has HPET. Can you post a boot.log?
Ok, boot.log attached. The only entry with hpet seems to indicate some problems.
A common problem however is that the irq 0 is misrouted somehow, and gets broadcasted and processed on multiple CPUs. That results in the time running far too fast. You can check that by looking at /proc/interrupts.
After rebooting today (and doing basically nothing), things look like that: # while [ .T. ]; do sleep 1; echo -n $(date); grep timer /proc/interrupts; done Mo Mai 9 12:47:37 CEST 2005 0: 4156834 4466062 IO-APIC-edge timer Mo Mai 9 12:47:38 CEST 2005 0: 4157847 4466062 IO-APIC-edge timer Mo Mai 9 12:47:39 CEST 2005 0: 4158861 4466062 IO-APIC-edge timer Mo Mai 9 12:47:40 CEST 2005 0: 4159874 4466062 IO-APIC-edge timer Mo Mai 9 12:47:41 CEST 2005 0: 4160886 4466062 IO-APIC-edge timer Mo Mai 9 12:47:42 CEST 2005 0: 4161899 4466062 IO-APIC-edge timer Mo Mai 9 12:47:43 CEST 2005 0: 4162913 4466062 IO-APIC-edge timer Mo Mai 9 12:47:44 CEST 2005 0: 4163926 4466062 IO-APIC-edge timer Mo Mai 9 12:47:45 CEST 2005 0: 4164938 4466062 IO-APIC-edge timer Mo Mai 9 12:47:46 CEST 2005 0: 4165951 4466062 IO-APIC-edge timer Mo Mai 9 12:47:47 CEST 2005 0: 4166396 4466631 IO-APIC-edge timer Mo Mai 9 12:47:48 CEST 2005 0: 4166396 4467644 IO-APIC-edge timer Mo Mai 9 12:47:49 CEST 2005 0: 4166396 4468656 IO-APIC-edge timer Mo Mai 9 12:47:50 CEST 2005 0: 4166396 4469668 IO-APIC-edge timer Mo Mai 9 12:47:51 CEST 2005 0: 4166396 4470681 IO-APIC-edge timer Mo Mai 9 12:47:52 CEST 2005 0: 4166396 4471694 IO-APIC-edge timer Mo Mai 9 12:47:53 CEST 2005 0: 4166396 4472708 IO-APIC-edge timer Mo Mai 9 12:47:54 CEST 2005 0: 4166396 4473720 IO-APIC-edge timer Mo Mai 9 12:47:55 CEST 2005 0: 4166396 4474733 IO-APIC-edge timer Adding load to one CPU changes things: # cat /dev/zero >/dev/null & # speed 1000000 2000000 # while [ .T. ]; do sleep 1; echo -n $(date); grep timer /proc/interrupts; done Mo Mai 9 12:48:52 CEST 2005 0: 4195741 4500873 IO-APIC-edge timer Mo Mai 9 12:48:53 CEST 2005 0: 4195741 4501882 IO-APIC-edge timer Mo Mai 9 12:48:54 CEST 2005 0: 4195741 4502893 IO-APIC-edge timer Mo Mai 9 12:48:55 CEST 2005 0: 4195741 4503902 IO-APIC-edge timer Mo Mai 9 12:48:56 CEST 2005 0: 4195741 4504913 IO-APIC-edge timer Mo Mai 9 12:49:01 CEST 2005 0: 4195958 4505706 IO-APIC-edge timer Mo Mai 9 12:49:03 CEST 2005 0: 4196968 4505706 IO-APIC-edge timer Mo Mai 9 12:49:04 CEST 2005 0: 4197977 4505706 IO-APIC-edge timer Mo Mai 9 12:49:06 CEST 2005 0: 4198986 4505706 IO-APIC-edge timer Mo Mai 9 12:49:07 CEST 2005 0: 4199997 4505706 IO-APIC-edge timer Mo Mai 9 12:49:09 CEST 2005 0: 4201006 4505706 IO-APIC-edge timer Mo Mai 9 12:49:10 CEST 2005 0: 4202015 4505706 IO-APIC-edge timer Mo Mai 9 12:49:04 CEST 2005 0: 4202868 4505706 IO-APIC-edge timer Mo Mai 9 12:49:12 CEST 2005 0: 4203675 4505706 IO-APIC-edge timer Mo Mai 9 12:49:14 CEST 2005 0: 4204685 4505706 IO-APIC-edge timer Mo Mai 9 12:49:15 CEST 2005 0: 4205376 4505713 IO-APIC-edge timer Mo Mai 9 12:49:16 CEST 2005 0: 4205376 4506724 IO-APIC-edge timer Mo Mai 9 12:49:17 CEST 2005 0: 4205376 4507734 IO-APIC-edge timer Mo Mai 9 12:49:18 CEST 2005 0: 4205376 4508743 IO-APIC-edge timer Mo Mai 9 12:49:19 CEST 2005 0: 4205376 4509752 IO-APIC-edge timer Mo Mai 9 12:49:20 CEST 2005 0: 4205376 4510761 IO-APIC-edge timer After stopping the load, the hickups continue: Mo Mai 9 12:56:28 CEST 2005 0: 4312541 4585753 IO-APIC-edge timer Mo Mai 9 12:56:29 CEST 2005 0: 4313554 4585753 IO-APIC-edge timer Mo Mai 9 12:56:30 CEST 2005 0: 4314568 4585753 IO-APIC-edge timer Mo Mai 9 12:57:20 CEST 2005 0: 4315424 4585756 IO-APIC-edge timer Mo Mai 9 12:57:21 CEST 2005 0: 4316437 4585756 IO-APIC-edge timer Mo Mai 9 12:57:22 CEST 2005 0: 4317449 4585756 IO-APIC-edge timer Mo Mai 9 12:57:23 CEST 2005 0: 4318461 4585756 IO-APIC-edge timer Mo Mai 9 12:57:24 CEST 2005 0: 4319474 4585756 IO-APIC-edge timer -- Bernd Paysan "If you want it done right, you have to do it yourself" http://www.jwdt.com/~paysan/
On Monday 09 May 2005 12:53, Bernd Paysan wrote:
On Sunday 08 May 2005 15:40, Andi Kleen wrote:
Your system should be using the HPET timer to work exactly around this. AMD 8000 has HPET. Can you post a boot.log?
Ok, boot.log attached. The only entry with hpet seems to indicate some problems.
I went through the BIOS setup, and found a disabled feature "ACPI 2.0", which I enabled. Now Linux finds the HPET timer. # grep -i hpet boot.log ACPI: HPET (v001 A M I OEMHPET 0x04000518 MSFT 0x00000097) @ 0x00000000e8ff3c30 ACPI: HPET id: 0x102282a0 base: 0xfec01000 time.c: Using 14.318180 MHz HPET timer. time.c: Using HPET based timekeeping. hpet0: at MMIO 0xfec01000, IRQs 2, 8, 0 hpet0: 69ns tick, 3 32-bit timers hpet_acpi_add: no address or irqs in _CRS and everything appears to work (though there's still no designated CPU to handle the timer interrupts). xntpd syncs quickly, I'm happy (so far ;-). So that explains why nobody sees this problem. But the TSC-based fallback timekeeping is still broken on SMP systems with PowerNow and distributed IRQ handling, which both together seem to be rare enough ;-). -- Bernd Paysan "If you want it done right, you have to do it yourself" http://www.jwdt.com/~paysan/
I went through the BIOS setup, and found a disabled feature "ACPI 2.0", which I enabled. Now Linux finds the HPET timer.
Great. The machine came like this? I wish OEMs wouldn't do such things...
# grep -i hpet boot.log ACPI: HPET (v001 A M I OEMHPET 0x04000518 MSFT 0x00000097) @ 0x00000000e8ff3c30 ACPI: HPET id: 0x102282a0 base: 0xfec01000 time.c: Using 14.318180 MHz HPET timer. time.c: Using HPET based timekeeping. hpet0: at MMIO 0xfec01000, IRQs 2, 8, 0 hpet0: 69ns tick, 3 32-bit timers hpet_acpi_add: no address or irqs in _CRS
and everything appears to work (though there's still no designated CPU to handle the timer interrupts). xntpd syncs quickly, I'm happy (so far ;-).
Great.
So that explains why nobody sees this problem. But the TSC-based fallback timekeeping is still broken on SMP systems with PowerNow and distributed IRQ handling, which both together seem to be rare enough ;-).
There is a patch pending for the TSC problem - using the pmtimer instead in this case. But the distributed timer interrupt problem is weird. It should not happen. You sure it was IRQ 0 that was duplicated and not "LOC" ? When you watch -n1 cat /proc/interrupts does the rate roughly match up to 1000Hz? -Andi
On Tuesday 10 May 2005 13:12, Andi Kleen wrote:
So that explains why nobody sees this problem. But the TSC-based fallback timekeeping is still broken on SMP systems with PowerNow and distributed IRQ handling, which both together seem to be rare enough ;-).
There is a patch pending for the TSC problem - using the pmtimer instead in this case.
But the distributed timer interrupt problem is weird. It should not happen. You sure it was IRQ 0 that was duplicated and not "LOC" ?
Yes. Only one CPU actually gets and handles the timer interrupt, but which one is somewhat random (for about 10 seconds, it's the same CPU, then it switches over).
When you watch -n1 cat /proc/interrupts does the rate roughly match up to 1000Hz?
Yes, and this is confirmed over longer time: # grep timer /proc/interrupts; uptime 0: 40347440 40582285 IO-APIC-edge timer 1:26pm an 22:28, 1 user, Durchschnittslast: 0,00, 0,01, 0,04 # echo $[(3600*22+28*60)*1000] $[40347440+40582285] 80880000 80929725 Given that uptime is only accurate to the minute, this sounds very reasonable. The distribution also is close to 50:50. That's (almost) true for all interrupt sources: # cat /proc/interrupts CPU0 CPU1 0: 40523846 40753939 IO-APIC-edge timer 1: 3 189 IO-APIC-edge i8042 8: 261 280 IO-APIC-edge rtc 9: 0 0 IO-APIC-level acpi 15: 364369 364479 IO-APIC-edge ide1 169: 59195 55498 IO-APIC-level 3w-9xxx 177: 618198 604643 IO-APIC-level 3w-9xxx 185: 8195891 8147619 IO-APIC-level aic79xx, eth1 193: 0 30 IO-APIC-level aic79xx 201: 0 0 IO-APIC-level ohci_hcd, ohci_hcd NMI: 1184 1013 LOC: 81273966 81271958 ERR: 0 MIS: 0 -- Bernd Paysan "If you want it done right, you have to do it yourself" http://www.jwdt.com/~paysan/
On Tue, May 10, 2005 at 01:54:53PM +0200, Bernd Paysan wrote:
On Tuesday 10 May 2005 13:12, Andi Kleen wrote:
So that explains why nobody sees this problem. But the TSC-based fallback timekeeping is still broken on SMP systems with PowerNow and distributed IRQ handling, which both together seem to be rare enough ;-).
There is a patch pending for the TSC problem - using the pmtimer instead in this case.
But the distributed timer interrupt problem is weird. It should not happen. You sure it was IRQ 0 that was duplicated and not "LOC" ?
Yes. Only one CPU actually gets and handles the timer interrupt, but which one is somewhat random (for about 10 seconds, it's the same CPU, then it switches over).
That could be irqbalance doing its thing. Does it go away when you stop it? -Andi
On Tuesday 10 May 2005 15:07, Andi Kleen wrote:
That could be irqbalance doing its thing. Does it go away when you stop it?
Yes, it seems to go away. -- Bernd Paysan "If you want it done right, you have to do it yourself" http://www.jwdt.com/~paysan/
On Tue, May 10, 2005 at 03:15:44PM +0200, Bernd Paysan wrote:
On Tuesday 10 May 2005 15:07, Andi Kleen wrote:
That could be irqbalance doing its thing. Does it go away when you stop it?
Yes, it seems to go away.
Ok, it is fine then. A bit unexpected, but fine. -Andi
participants (2)
-
Andi Kleen
-
Bernd Paysan