On Wed, Apr 21, 2004 at 11:45:33AM -0500, Kevin_Gassiot@veritasdgc.com wrote:
We have seen a problem with copying a large volume between machines and the resulting volume having a different checksum than the source volume. We found a bad DIMM on the target machine using memtest, but we now have questions that need answers for our users.
Since the memory controller is built in to the Opteron, and the Opteron uses ECC memory, is the ECC correction done at the hardware level via the memory controller ? Is this dependent on the motherboard chipset, or are the options the same regardless of chipset due to the controller being built onto the processor ?
The complete memory controller is in the CPU, which includes ECC and chipkill handling. This means the basics are all the same independent of the chipset. However if it works depends on how the BIOS programs it (BIOS has to program ECC and various other modes) and possibly how the motherboard is layouted. The kernel has nothing to do with memory controller programming.
I seem to remember seeing some messages about whether or not to turn on background scrubbing on the ECC system at the BIOS level. Are there problems with the Linux kernel doing this, or is just a memory latency issue due to overhead ?
The linux kernel has no problems with background scrubbing, but I think there were erratums in this area in some early steppings of the CPU. The BIOS should take care of that though.
If the hardware controller cannot correct the error, does it raise an exception to the OS ? If so, does the Linux kernel catch the signal ? Log the error, or crash the system ? Is this configurable ?
The Opteron unfortunately cannot handle 2bit errors; it will force an reboot. The BIOS should log this event after the reboot in its BIOS event log, so you can diagnose it. This is a hardware limitation, nothing the OS can do about it. If there is an corrected one bit error the kernel will detect it and log a message to the kernel log.
We have a mix of the UnitedLinux kernel 2.4.19-smp, SuSE 9.0 2.4.21-193-smp, and the 2.4.22 kernel from kernel.org. Do these kernels need to be ECC aware to catch sny signals, and if so, are they ?
They are all ECC aware, however in some older kernels there were bugs in the way the MCE is decoded, so not every detail the kernel prints may be correct. The basic address and the fact that an error occurred are not lies however. -Andi