RE: [suse-programming-e] Hardware access to RAM

I will explain what happens on i386.
I understand that if I have a pointer, ptr, which points to an array of some structure then ptr++ will point to the next element. But that is all done in software. I want to optimise my software to match what happens in hardware.
The software translates a ++ptr to adding the element size to the logical address stored in ptr. If ptr was declared as an array of int's, then ++ptr would add 4 to ptr.
If I have a pointer pointing to address 256 and then look at address 257, by how many BYTES have I jumped in physical memory?
2 With "physical memory" I assume you mean physical memory as in the following: Through a hardware/software mechanisms, logical addresses are translated to linear addresses (segmentation) and then to a physical addresses (paging). With Linux, both addresses 256 and 257 lie in the same page (each 4096 bytes). All the bytes within a single page maps to a single physical page frame of the same size. The kernel assigns (almost) arbitrary page frames to pages. So if the two addresses were on each side of a page boundary, you do not know the difference of the physical addresses.
In old style memory of 16 bit words it would jump one word, that is 4 BYTES, but in a machine with 8 bit words it would jump one word of 2 BYTES. That is, always one machine word. Also, if your machine was up to date, it always had the matching data bus to access a full word (machine word, not a Bill Gates 'word') in one take. (Cheap machines, with a cheap data-bus, would take two mouths full.)
Does not make sense to me. Are you talking about the the size of the memory fetches the CPU caches are transfering memory to/from memory ?
An equivalent question would be: If I have a string, Str(16), of bytes Str(0) to Str(15) containing "ABCDEFGHIJKLMNOn" (n=NUL) in a 32 bit machine, does the request for Str(9) ie "J" electronically load this character directly into the least significant byte of the register, (doubt it) or, by assembler code (generated by the compiler) Load the characters "IJKL" into the register, shift right, ("00IJ") and mask ("oooJ") ? That is: does the electronic system access one byte directly? or perhaps the load/shift/mask is done in hardware? (doubt it)
One byte (assuming you are using 1 byte characters).
For me the important issue is, for a 64 bit machine: I suspect that access to IntArray(23) is faster if defined as int64_t IntArray(128) ie as 64 bit words with the top bits generally being zero than if defined as int32_t IntArray(128) ie as 32 bit words because, for a 64 bit machine, the latter will store two values into each physical address.
The CPU caches would rather see your array in the smallest possible memory region. When you want access to Str(9), it fetches, say, 128 bits around that value to cache. The when you ask for Str(10), it is already in the cache. But some operations may be faster for 64 bits than for 32 bits. I guess the operations multiplication and division/remainder might be faster (ask AMD). So whether int32_t or int64_t is fastest is probably very dependent on your application. Regards, Håkon Hallingstad Software Developer, EDB +47 2252 8218 hakon.hallingstad@edb.com www.edb.com "IT er ikke alt - men det hjelper"

On Sunday 13 March 2005 20:21, Hallingstad Håkon wrote:
I will explain what happens on i386.
I understand that if I have a pointer, ptr, which points to an array of some structure then ptr++ will point to the next element. But that is all done in software. I want to optimise my software to match what happens in hardware.
The software translates a ++ptr to adding the element size to the logical address stored in ptr. If ptr was declared as an array of int's, then ++ptr would add 4 to ptr.
Yes, as I mentioned, I understand this.
If I have a pointer pointing to address 256 and then look at address 257, by how many BYTES have I jumped in physical memory?
2
2? are you sure? It would be two in an old fashioned 16 bit machine with a 16 bit databus; but this is a 64 bit machine and the original definition of a word was a hardware word (not a Bill Gates 16 bit thingy.)
With "physical memory" I assume you mean physical memory as in the following: Through a hardware/software mechanisms, logical addresses are translated to linear addresses (segmentation) and then to a physical addresses (paging).
segmentation was an old fashioned idea because of the limit to address size in a 16 bit machine. do we still need to reference it?
With Linux, both addresses 256 and 257 lie in the same page (each 4096 bytes). All the bytes within a single page maps to a single physical page frame of the same size.
ah! this is one of the ideas I seek: is address 256 pointing to byte number 256-1 = 255, and 257 pointing to the very next character in the string or: byte number 8 x 256 -1 = 2096-1 and 257 to 8 x 257 - 1 = last plus 8 bytes because the memory is 64 bit (8 byte) ?
The kernel assigns (almost) arbitrary page frames to pages. So if the two addresses were on each side of a page boundary, you do not know the difference of the physical addresses.
In old style memory of 16 bit words it would jump one word, that is 4 BYTES, but in a machine with 8 bit words it would jump one word of 2 BYTES. That is, always one machine word. Also, if your machine was up to date, it always had the matching data bus to access a full word (machine word, not a Bill Gates 'word') in one take. (Cheap machines, with a cheap data-bus, would take two mouths full.)
Does not make sense to me. Are you talking about the the size of the memory fetches the CPU caches are transfering memory to/from memory ?
yes - see the note above re your answer of '2'
An equivalent question would be: If I have a string, Str(16), of bytes Str(0) to Str(15) containing
"ABCDEFGHIJKLMNOn"
(n=NUL) in a 32 bit machine, does the request for Str(9) ie "J"
electronically load this
character directly into the least significant byte of the register, (doubt
it) or, by
assembler code (generated by the compiler) Load the characters "IJKL"
into the register,
shift right, ("00IJ") and mask ("oooJ") ? That is: does the electronic
system access one
byte directly? or perhaps the load/shift/mask is done in hardware? (doubt
it)
One byte (assuming you are using 1 byte characters).
If you say this you are implying that the wires connected to bits 9-16 (1 based, and MS storage mode) can be directly connected to bits 1-8 of the register. Are you sure you meant this?
For me the important issue is, for a 64 bit machine: I suspect that access to IntArray(23) is faster if defined as int64_t IntArray(128) ie as 64 bit words with the top bits generally being zero than if
defined as int32_t IntArray(128)
ie as 32 bit words because, for a 64 bit machine, the latter will store two values into each physical address.
The CPU caches would rather see your array in the smallest possible memory region. When you want access to Str(9), it fetches, say, 128 bits around that value to cache. The when you ask for Str(10), it is already in the cache. okay - this agrees with what you said earlier. I can accept this as a good idea. In turn this means that loading a 'packed array' would be faster. But is getting from cache to register as fast as if it were defined as 64 bit? You idea here would mean that access to 32/64 would gain/lose dependent on whether or not the integer is an individual or part of an array.
But some operations may be faster for 64 bits than for 32 bits. I guess the operations multiplication and division/remainder might be faster (ask AMD). Yes, I agree. Except that the maths co-processor is often more bits that 64.
So whether int32_t or int64_t is fastest is probably very dependent on your application.
Regards, Håkon Hallingstad
Thanks Hakon for your discussion. Regards, Colin
participants (2)
-
Colin Carter
-
Hallingstad Håkon