Andi Kleen <ak@suse.de> wrote:
Thanks for the useful description, Erich. If there was a FAQ for this mailing list I would propose to put this in because it's really asked a lot.
Some comments:
On Mon, Jul 11, 2005 at 09:06:06AM -0700, Erich Boleyn wrote:
I have an S2885, and with both the BIOS versions where the "memory hole remapping: software" was available, memory copy bandwidth in the bottom 4 GB (including the remapped portion) was considerably decreased, from about 1.7GB/sec in the non-remapped case, to about 1.2GB/sec. YMMV, but it seems there is either a performance issue with the software remapping in general or a bug with the implementation. I've informed Tyan of the issue. (My server has 8GB physically installed, but only 7170-ish MB is available, with remapping it puts the memory between 3 & 4GB up at the end of the address space, so physical memory appears to go up to the 9GB mark)
Are you sure it works that way in old pre E stepping world?
My understanding was that just the next DIMM is mapped above 4GB. e.g. if you have 2 2GB DIMMs, then first DIMM would be from 0 to 2GB and the next one from 4GB to 6GB with remapping on, leaving a 2GB hole.
As the saying goes, we're both right. I said between 3 & 4GB, but I forgot to mention my system is loaded with 1GB DIMMs. Though, I *thought* it was a bit different: My understanding of the way it was supposed to work was: -- Take the last *bank* of the DIMMs, and map that high, but across the full width of the controller. -- Most large DIMMs are dual-banked, and in the case I have, I have 4 1GB DIMMs which are each dual-banked. So, for each DIMM, the last bank is 0.5GB. And it remapping between 3 & 4GB would make sense. For the Rev E world, my understanding is that it remaps just the addresses you need in hardware without this memory controller hack, so I think it should fix any (perhaps inadvertant) slowdown.
One theory for the slowdown was broken MTRR entries, but I don't think they can explain your case. If the MTRR is not set to write back for memory it's either very slow or normal performance, not a slight slow down like you see.
If I misunderstood or there is a bug in what Tyan did, that would explain the slowdown. If they can only remap one DIMM, then you have to ungang the link and not run it in interleaved fashion, so you lose maximum memory bandwidth on all the DIMMs on the Opteron Northbrige involved. I'm going to see if we can fix that with Tyan. Worst case, you could probably get them to remap both of the 2 whole DIMMs up if you had 1GB DIMMs fully populated... (i.e. if 4 1GB DIMMs, it would remap 2 of them high and therefore preserve the interleaved state of the memory controller).
So, a "hole" at 4GB which grows downward depending on the total size of the addresses used is punched. It consists of (usually more or less in this order growing downward from 4GB):
-- x86 ROM/BIOS (this really is required to be aligned against the top of the 4GB boundary) -- IO APIC -- PCI/AGP devices as mapped in PCI config space numerically (but not required to be in that order) NOTE each of the PCI domains are required to be aligned to their native size. -- AGP aperture (NOTE: the agp aperture is required to be aligned to it's native size.
There needs to be also some free space. e.g. with PCI hotplug or Cardbus in laptops the OS needs to assign IO space later after boot. It will also do this if the BIOS "forgot" to assign some IO mappings in devices.
It's tricky because the BIOS doesn't know how much is needed later.
This is a continuous source of problems on Linux too. I tried to map into unused space in this case (space not reported as used in the e820 map), but that also leads to mysterious failures on some boards.
Agreed.
The point of the "native size alignment" comments above is that, for example, if you have, say a video card with a 256MB video buffer, then at minimum, you'll have at least a 512MB "PCI hole", since the BIOS needs to be the last part before 4GB and the video buffer must be aligned to a 256MB (i.e. native size) boundary. If it was a 512MB video buffer, then you'd be guaranteed to lose at least 1GB.
The aperture must be aligned this way, but I am not sure about the video buffer. It might be a secondary requirement to make MTRR assignment easier ("MTRRs - the bane of the x86 world")
Native size alignment is a requirement of both the AGP aperture AND all PCI BARs (Base Address Registers, the mechanism whereby any RAM or Memory-mapped I/O is mapped into the physical address space). In the case of PCI, it is not possible to map them in any way but aligning to the size of the mapped area. All address bits below the power-of-2 size of the item being mapped are guaranteed to be masked out in the PCI specification. In PCI, the official way to determine the size of what a BAR can map is in fact by writing all 1's into the register and seeing which bits were masked out on the bottom, with the requirement that it must be a power-of-2. I've always thought that was a bit wacky. For AGP, the specification explicitly states that the aperture must be aligned, but the address register does happen to be separate from the way to set/determine the size of the window. -- Erich Stefan Boleyn <erich@uruk.org> http://www.uruk.org/ "Reality is truly stranger than fiction; Probably why fiction is so popular"