[opensuse] Performance penalty for mis-aligned partitions on a 4K physical sector drive

older
[opensuse] typing correction in...

Greg Freemyer

16 Mar 2015 16 Mar '15

13:32

Felix, You asked about the subject, but your thread went on in other directions. The performance penalty varies based on access pattern. For reads, there is basically no penalty. A modern rotating disk reads an entire track at a time and puts it in the drives internal cache. A modern track is roughly 1MB, so when you see a disk drive with a 8MB cache, that means it can hold 8 tracks. It does NOT mean it will hold 8MB of random i/o data scattered around the disk. It is far better to think about that cache as 8 track buffers. So if you issue a unaligned page read, the drive will read in the entire track the page resides on and then will transfer that to the kernel. No meaningful penalty at all if the unaligned page resides on a single track. If the unaligned page resides on 2 tracks due to the misalignment it will trigger 2 track reads and possibly a disk seek at the internal drive level. If the 2 tracks are on the same cylinder, a disk seek is still not needed, so the penalty is another rotation of the disk. If the 2 tracks are on different cylinders, then a disk seek between 2 contiguous cylinders is needed, but that is also pretty quick (a couple msecs I think with modern drives). If a track is 1MB, then there are 250 4KB pages / track, so only one time in 250 do you have a performance penalty. (Drives have variable pages / track based on the diameter of the specific track. Tracks near the center of the drive have a much small diameter than a track on the outer edge, so the likelihood of incurring the penalty is about twice as high at the "end" of the disk. The trouble is with writes. For long sequential writes, there is minimal penalty again because the misalignment only happens at the start and end of the transfer. ie. dd bs=1MB will trigger 1 MB i/o's to the drive and only the first and last page get hit with a performance penalty. If the workload is small 1 page writes, then the penalty is huge. For every write a read/modify/write cycle has to be implemented. At the drive level that means every write requires an entire disk rotation to implement the read. The modify is free, but then a second rotation is needed to do the write. A 7200 RPM drive takes roughly 8.3 milliseconds to make a rotation, so every 1 page unaligned write takes an extra 8.3 milliseconds to complete. You would need to benchmark your typical load, but if you do lots of small i/o's, I think you will find that 8.3 msecs a pretty major penalty. == specific to 1 KB pages == Alignment won't help much in the general case. If you write a 1KB random i/o to a 4KB page, the drive is still forced to do: read track, modify data, write track. If the 1KB page is unaligned, it is the same except in the rare case that the 1 KB is split over 2 tracks where it becomes: read track1, modify data, write track1 seek disk if needed read track2, modify data, write track2 But a 1KB page being split between 2 tracks should happen much less than 1% of the time. If you are going to use drives with 4KB physical sectors you need to avoid filesystems with 1KB blocks. Greg -- Greg Freemyer -- To unsubscribe, e-mail: opensuse+unsubscribe@opensuse.org To contact the owner, e-mail: opensuse+owner@opensuse.org

Show replies by date

Anton Aylward

16 Mar 16 Mar

14:23

On 03/16/2015 09:32 AM, Greg Freemyer wrote:

...

If you are going to use drives with 4KB physical sectors you need to avoid filesystems with 1KB blocks.

I'm sure there are many in the silent majority here who don't have the detailed knowledge, or, like me, have let it pass them by since they deal with other matters (such as users and applications), and wonder about one or another implication in that statement. We've seen the example with mkfs or extFS for various block sizes, but what about other file systems? And more to the point for most of us: How can we tell about these things? * What block size the disks are are I suppose all late model disks are 4K :-) * what size the file system blocks are for the file systems in use - Not just extFS but XFS, reiserFS, BtrFS - if the are not 4K what can we do about it? * Some might ask about the other file systems in /proc/filesystems. - Does it matter with tmpfs? What about when it 'overflows' to /tmp? * How can we tell if they are aligned? - if they are not, what can we do about it? Either it matters, and these are the questions that emerge, or it doesn't. Speaking for myself, if this issue had never came up, it wouldn't have mattered to me[1], but I'm sure there are people out there who do care even if they are not sysadmins of a large pool of servers. I have friends who are gamers and fall into that category. [1] Regular readers will recall that since the equipment I play with from the Closet of Anxieties is never leading edge, ultimate performance isn't a burning concern. And professionally, the big IBM, HP etc machines are a separate issue. -- A: Yes. > Q: Are you sure? >> A: Because it reverses the logical flow of conversation. >>> Q: Why is top posting frowned upon? -- To unsubscribe, e-mail: opensuse+unsubscribe@opensuse.org To contact the owner, e-mail: opensuse+owner@opensuse.org

Per Jessen

15:00

Anton Aylward wrote:

...

On 03/16/2015 09:32 AM, Greg Freemyer wrote:

...
If you are going to use drives with 4KB physical sectors you need to avoid filesystems with 1KB blocks.

I'm sure there are many in the silent majority here who don't have the detailed knowledge, or, like me, have let it pass them by since they deal with other matters (such as users and applications), and wonder about one or another implication in that statement.

We've seen the example with mkfs or extFS for various block sizes, but what about other file systems?

And more to the point for most of us:

How can we tell about these things?

* What block size the disks are are

I suppose all late model disks are 4K :-)

Probably the larger ones, but I have a bunch of 2Tb drives, they're all 512bytes.

...

* How can we tell if they are aligned?

Check the partition boundaries. If they are divisible by <blocksize>, a partition is aligned.

...

- if they are not, what can we do about it?

Repartition. -- Per Jessen, Zürich (12.8°C) http://www.dns24.ch/ - your free DNS host, made in Switzerland. -- To unsubscribe, e-mail: opensuse+unsubscribe@opensuse.org To contact the owner, e-mail: opensuse+owner@opensuse.org

Felix Miata

17:49

Per Jessen composed on 2015-03-16 16:00 (UTC+0100):

...

Anton Aylward wrote:

...

...
I suppose all late model disks are 4K :-)

...

Probably the larger ones, but I have a bunch of 2Tb drives, they're all 512bytes.

If the actual date of manufacture (often missing from labels of refurbs) is post-2010, it is almost certainly 4k, regardless of size. The only 2TB devices I have were manufactured before 2011. IOW, Anton was on track, everything made in the past 4+ years is 4k. -- "The wise are known for their understanding, and pleasant words are persuasive." Proverbs 16:21 (New Living Translation) Team OS/2 ** Reg. Linux User #211409 ** a11y rocks! Felix Miata *** http://fm.no-ip.com/ -- To unsubscribe, e-mail: opensuse+unsubscribe@opensuse.org To contact the owner, e-mail: opensuse+owner@opensuse.org

Chris Murphy

19:28

On Mon, Mar 16, 2015 at 11:49 AM, Felix Miata wrote:

...

Per Jessen composed on 2015-03-16 16:00 (UTC+0100):

...
Anton Aylward wrote:

...
...
I suppose all late model disks are 4K :-)

...
Probably the larger ones, but I have a bunch of 2Tb drives, they're all 512bytes.

If the actual date of manufacture (often missing from labels of refurbs) is post-2010, it is almost certainly 4k, regardless of size. The only 2TB devices I have were manufactured before 2011. IOW, Anton was on track, everything made in the past 4+ years is 4k.

HGST lists the Ultrastar 7K4000's as a 2013 product. They come in 512n versions up to 4TB. http://www.hgst.com/tech/techlib.nsf/techdocs/FD3F376DC2ECCE68882579D40082C3... http://www.hgst.com/tech/techlib.nsf/techdocs/9E4E119077AD1D8B86256DD0005A2F... All of the SAS versions are 512n also. -- Chris Murphy -- To unsubscribe, e-mail: opensuse+unsubscribe@opensuse.org To contact the owner, e-mail: opensuse+owner@opensuse.org

Linda Walsh

21:12

New subject: [opensuse] Re: Performance penalty for mis-aligned partitions on a 4K physical sector drive

Chris Murphy wrote:

...

On Mon, Mar 16, 2015 at 11:49 AM, Felix Miata wrote:

...
IOW, Anton was on track, everything made in the past 4+ years is 4k.

HGST lists the Ultrastar 7K4000's as a 2013 product. They come in 512n versions up to 4TB.

The 512n = it is compatible with 512-byte OS's, because the *disk* firmware does the buffering transparently for drivers and controllers that don't know about it's 4k sector size. 512n = 512new-format. I have multiples of this... and my old RAID controller, an LSI8080-8e would only read them as 512 byte sectors -- even though the linux kernel would show them their physical size as 4096 bytes: My 'sda' is a raid of these': showing pwd cuz my prompt is trimmed to show only 1st and last parts of long paths: Ishtar:/sys/devices/pci0000:00/../sda/queue> pwd /sys/devices/pci0000:00/0000:00:09.0/0000:07:00.0/host0/target0:2:0/0:2:0:0/block/sda/queue Ishtar:/sys/devices/pci0000:00/../sda/queue> cat logical_block_size 512 Ishtar:/sys/devices/pci0000:00/../sda/queue> cat physical_block_size 4096 Ishtar:/sys/devices/pci0000:00/../sda/queue> -- To unsubscribe, e-mail: opensuse+unsubscribe@opensuse.org To contact the owner, e-mail: opensuse+owner@opensuse.org

Chris Murphy

21:16

New subject: [opensuse] Re: Performance penalty for mis-aligned partitions on a 4K physical sector drive

On Mon, Mar 16, 2015 at 3:12 PM, Linda Walsh wrote:

...

Chris Murphy wrote:

...
On Mon, Mar 16, 2015 at 11:49 AM, Felix Miata wrote:

...
IOW, Anton was on track, everything made in the past 4+ years is 4k.

HGST lists the Ultrastar 7K4000's as a 2013 product. They come in 512n versions up to 4TB.

---- The 512n = it is compatible with 512-byte OS's, because the *disk* firmware does the buffering transparently for drivers and controllers that don't know about it's 4k sector size.

512n = 512new-format.

No. The n means native. The spec sheet I included specifically distinguishes between these drives at a physical level, their areal densities are different, it's not just a matter of a different kind of abstraction.

...

I have multiples of this... and my old RAID controller, an LSI8080-8e would only read them as 512 byte sectors -- even though the linux kernel would show them their physical size as 4096 bytes: My 'sda' is a raid of these': showing pwd cuz my prompt is trimmed to show only 1st and last parts of long paths:

Ishtar:/sys/devices/pci0000:00/../sda/queue> pwd /sys/devices/pci0000:00/0000:00:09.0/0000:07:00.0/host0/target0:2:0/0:2:0:0/block/sda/queue Ishtar:/sys/devices/pci0000:00/../sda/queue> cat logical_block_size 512 Ishtar:/sys/devices/pci0000:00/../sda/queue> cat physical_block_size 4096 Ishtar:/sys/devices/pci0000:00/../sda/queue>

This is clearly a 512e drive, not 512n. -- Chris Murphy -- To unsubscribe, e-mail: opensuse+unsubscribe@opensuse.org To contact the owner, e-mail: opensuse+owner@opensuse.org

Linda Walsh

17 Mar 17 Mar

22:20

New subject: [opensuse] Re: Performance penalty for mis-aligned partitions on a 4K physical sector drive

...

...
Ishtar:/sys/devices/pci0000:00/../sda/queue> pwd /sys/devices/pci0000:00/0000:00:09.0/0000:07:00.0/host0/target0:2:0/0:2:0:0/block/sda/queue Ishtar:/sys/devices/pci0000:00/../sda/queue> cat logical_block_size 512 Ishtar:/sys/devices/pci0000:00/../sda/queue> cat physical_block_size 4096 Ishtar:/sys/devices/pci0000:00/../sda/queue>

This is clearly a 512e drive, not 512n. === Ok.. mea culpa -- when I read the specs, the native drive didn't have a suffix... but i'm sure that highlighted

Chris Murphy wrote: the difference too well. Though, I'm not sure I agree with your adjective "clearly"... *ahem*... That's what I get for not rechecking current terminology before posting. :-( <---*bang* -- To unsubscribe, e-mail: opensuse+unsubscribe@opensuse.org To contact the owner, e-mail: opensuse+owner@opensuse.org

Chris Murphy

22:41

New subject: [opensuse] Re: Performance penalty for mis-aligned partitions on a 4K physical sector drive

On Tue, Mar 17, 2015 at 4:20 PM, Linda Walsh wrote:

...

Chris Murphy wrote:

...
...
Ishtar:/sys/devices/pci0000:00/../sda/queue> pwd

/sys/devices/pci0000:00/0000:00:09.0/0000:07:00.0/host0/target0:2:0/0:2:0:0/block/sda/queue Ishtar:/sys/devices/pci0000:00/../sda/queue> cat logical_block_size 512 Ishtar:/sys/devices/pci0000:00/../sda/queue> cat physical_block_size 4096 Ishtar:/sys/devices/pci0000:00/../sda/queue>

This is clearly a 512e drive, not 512n.

=== Ok.. mea culpa -- when I read the specs, the native drive didn't have a suffix... but i'm sure that highlighted the difference too well. Though, I'm not sure I agree with your adjective "clearly"... *ahem*...

Yeah I can definitely appreciate that. The terms are non-obvious. Presumably 512n as a term only came about after 512e, despite being first. And it's also true that kernel reporting logical/physical is only consistent in a direct attached (no enclosure) context. Thus far I've seen enclosures misreport 512e drives as logical_block_size 512 physical_block_size 512 and also logical_block_size 4096 physical_block_size 4096 I haven't yet seen an encosure either correctly pass through, or obscate a 512n drive such that logical_block_size 512 physical_block_size 4096 Which is the only totally unclear way I arrived at clearly it's 512e. Passthrough would be good, but obfuscating a 512n drive as having 4096 byte physical sectors would be sabotage.

...

That's what I get for not rechecking current terminology before posting.

:-( <---*bang*

Oh no, wait for the base 2 / IEC / kibi,mebi,gibi vs base 10 / SI / kilo,mega,giga thread. -- Chris Murphy -- To unsubscribe, e-mail: opensuse+unsubscribe@opensuse.org To contact the owner, e-mail: opensuse+owner@opensuse.org

Per Jessen

16 Mar 16 Mar

19:35

Felix Miata wrote:

...

Per Jessen composed on 2015-03-16 16:00 (UTC+0100):

...
Anton Aylward wrote:

...
...
I suppose all late model disks are 4K :-)

...
Probably the larger ones, but I have a bunch of 2Tb drives, they're all 512bytes.

If the actual date of manufacture (often missing from labels of refurbs) is post-2010, it is almost certainly 4k, regardless of size. The only 2TB devices I have were manufactured before 2011. IOW, Anton was on track, everything made in the past 4+ years is 4k.

Empirical evidence says otherwise. I have for instance 24 SATA drives currently active, all 2Tb. They're running 24x7, I'm judging their approx. manufacture date by power_on_hours. The oldest is from Jan2012, the youngest from March 2014. They're all 512bytes block size. And yes, they were all deployed within days of purchase. Manufacture date could vary by a couple of months I suppose. I bought some WDC RE4's only just recently, but they're spares, I don't know their blocksizes. -- Per Jessen, Zürich (9.9°C) http://www.hostsuisse.com/ - dedicated server rental in Switzerland. -- To unsubscribe, e-mail: opensuse+unsubscribe@opensuse.org To contact the owner, e-mail: opensuse+owner@opensuse.org

Felix Miata

17 Mar 17 Mar

03:59

Per Jessen composed on 2015-03-16 20:35 (UTC+0100):

...

Felix Miata wrote:

...

...
If the actual date of manufacture (often missing from labels of refurbs) is post-2010, it is almost certainly 4k, regardless of size. The only 2TB devices I have were manufactured before 2011. IOW, Anton was on track, everything made in the past 4+ years is 4k.

...

Empirical evidence says otherwise. I have for instance 24 SATA drives currently active, all 2Tb. They're running 24x7, I'm judging their approx. manufacture date by power_on_hours. The oldest is from Jan2012, the youngest from March 2014.

Can't read the labels on any of them?

...

They're all 512bytes block size. And

Determined how? What do the manufacturer's specifications report? As I replied to Chris earlier today, not every utility ostensibly able to report a device's block size will report it correctly. Going through a RAID controller would raise my level of suspicion.

...

yes, they were all deployed within days of purchase. Manufacture date could vary by a couple of months I suppose.

More than a couple. I don't have 24 of any one model, but I do have models with manufacture dates more than 6 months apart. I suppose anyone trying to match a big RAID's installed models with new acquisitions just might find them separated by 2 years or more.

...

I bought some WDC RE4's only just recently, but they're spares, I don't know their blocksizes.

Their labels you surely could read to get the dates. :-) -- "The wise are known for their understanding, and pleasant words are persuasive." Proverbs 16:21 (New Living Translation) Team OS/2 ** Reg. Linux User #211409 ** a11y rocks! Felix Miata *** http://fm.no-ip.com/ -- To unsubscribe, e-mail: opensuse+unsubscribe@opensuse.org To contact the owner, e-mail: opensuse+owner@opensuse.org

Per Jessen

07:46

Felix Miata wrote:

...

Per Jessen composed on 2015-03-16 20:35 (UTC+0100):

...
Felix Miata wrote:

...
...
If the actual date of manufacture (often missing from labels of refurbs) is post-2010, it is almost certainly 4k, regardless of size. The only 2TB devices I have were manufactured before 2011. IOW, Anton was on track, everything made in the past 4+ years is 4k.

...
Empirical evidence says otherwise. I have for instance 24 SATA drives currently active, all 2Tb. They're running 24x7, I'm judging their approx. manufacture date by power_on_hours. The oldest is from Jan2012, the youngest from March 2014.

Can't read the labels on any of them?

Nope, they're all mounted in trays, I'd have to pull them to read the labels.

...

...
They're all 512bytes block size. And

Determined how?

Just with fdisk.

...

What do the manufacturer's specifications report?

Okay, you've tricked me into it :-) WDC RE4 - 512bytes. HGST Ultrastar 7K3000 - 512bytes. HGST Ultrastar A7K2000 - 512bytes My feeling is that 4K blocksize are more often seen on disk-sizes 3Tb and up. I have some WDC 3Tb and some Seagate 4Tb drives for my mythtv setup, I'm certain they're all 4K. Here's one in-store even being advertised as 512bytes sector: https://www.pcp.ch/Hitachi-Ultrastar-7K4000-2TB-3.5-Sector-size-512e-1a17471...

...

As I replied to Chris earlier today, not every utility ostensibly able to report a device's block size will report it correctly. Going through a RAID controller would raise my level of suspicion.

They are connected to a RAID controller, although configured as JBOD.

...

...
yes, they were all deployed within days of purchase. Manufacture date could vary by a couple of months I suppose.

More than a couple. I don't have 24 of any one model, but I do have models with manufacture dates more than 6 months apart.

...
I bought some WDC RE4's only just recently, but they're spares, I don't know their blocksizes.

Their labels you surely could read to get the dates. :-)

Hehe, true :-) Well, currently we have 10 spares, all WDC RE4 2Tb, manufactured in April or August 2014. AFAICT, bought in batches in March, June and September 2014. (Sep2014 = "recently" ... ). -- Per Jessen, Zürich (6.2°C) http://www.dns24.ch/ - free dynamic DNS, made in Switzerland. -- To unsubscribe, e-mail: opensuse+unsubscribe@opensuse.org To contact the owner, e-mail: opensuse+owner@opensuse.org

Per Jessen

07:51

Per Jessen wrote:

...

My feeling is that 4K blocksize are more often seen on disk-sizes 3Tb and up. I have some WDC 3Tb and some Seagate 4Tb drives for my mythtv setup, I'm certain they're all 4K.

Here's one in-store even being advertised as 512bytes sector:

https://www.pcp.ch/Hitachi-Ultrastar-7K4000-2TB-3.5-Sector-size-512e-1a17471...

...

Hmm, I was too fast there. I guess 512e means something else - this is the datasheet, those drives seems to come in two versions - 512e(mulation) and 512n(ative) : https://www.pcp.ch/Ultrastar-7K4000-2TB-3.5-Sector-size-512e-td17471118.htm -- Per Jessen, Zürich (6.4°C) http://www.hostsuisse.com/ - virtual servers, made in Switzerland. -- To unsubscribe, e-mail: opensuse+unsubscribe@opensuse.org To contact the owner, e-mail: opensuse+owner@opensuse.org

Andrei Borzenkov

08:12

On Tue, Mar 17, 2015 at 10:51 AM, Per Jessen wrote:

...

Per Jessen wrote:

...
My feeling is that 4K blocksize are more often seen on disk-sizes 3Tb and up. I have some WDC 3Tb and some Seagate 4Tb drives for my mythtv setup, I'm certain they're all 4K.

Here's one in-store even being advertised as 512bytes sector:

https://www.pcp.ch/Hitachi-Ultrastar-7K4000-2TB-3.5-Sector-size-512e-1a17471...

...
Hmm, I was too fast there. I guess 512e means something else - this is the datasheet, those drives seems to come in two versions - 512e(mulation) and 512n(ative) :

https://www.pcp.ch/Ultrastar-7K4000-2TB-3.5-Sector-size-512e-td17471118.htm

512e means emulation mode - underlying physical sector size is different (probably 4K) but it pretends to have 512B to outside world. 512n is native 512B per sector. -- To unsubscribe, e-mail: opensuse+unsubscribe@opensuse.org To contact the owner, e-mail: opensuse+owner@opensuse.org

Malcolm

16 Mar 16 Mar

14:34

New subject: [opensuse] Re: Performance penalty for mis-aligned partitions on a 4K physical sector drive

On Mon 16 Mar 2015 09:32:09 AM CDT, Greg Freemyer wrote:

...

Felix,

You asked about the subject, but your thread went on in other directions.

The performance penalty varies based on access pattern.

For reads, there is basically no penalty. A modern rotating disk reads an entire track at a time and puts it in the drives internal cache. A modern track is roughly 1MB, so when you see a disk drive with a 8MB cache, that means it can hold 8 tracks.

It does NOT mean it will hold 8MB of random i/o data scattered around the disk. It is far better to think about that cache as 8 track buffers.

So if you issue a unaligned page read, the drive will read in the entire track the page resides on and then will transfer that to the kernel. No meaningful penalty at all if the unaligned page resides on a single track.

If the unaligned page resides on 2 tracks due to the misalignment it will trigger 2 track reads and possibly a disk seek at the internal drive level. If the 2 tracks are on the same cylinder, a disk seek is still not needed, so the penalty is another rotation of the disk. If the 2 tracks are on different cylinders, then a disk seek between 2 contiguous cylinders is needed, but that is also pretty quick (a couple msecs I think with modern drives).

If a track is 1MB, then there are 250 4KB pages / track, so only one time in 250 do you have a performance penalty. (Drives have variable pages / track based on the diameter of the specific track. Tracks near the center of the drive have a much small diameter than a track on the outer edge, so the likelihood of incurring the penalty is about twice as high at the "end" of the disk.

The trouble is with writes. For long sequential writes, there is minimal penalty again because the misalignment only happens at the start and end of the transfer.

ie. dd bs=1MB will trigger 1 MB i/o's to the drive and only the first and last page get hit with a performance penalty.

If the workload is small 1 page writes, then the penalty is huge. For every write a read/modify/write cycle has to be implemented.

At the drive level that means every write requires an entire disk rotation to implement the read. The modify is free, but then a second rotation is needed to do the write.

A 7200 RPM drive takes roughly 8.3 milliseconds to make a rotation, so every 1 page unaligned write takes an extra 8.3 milliseconds to complete.

You would need to benchmark your typical load, but if you do lots of small i/o's, I think you will find that 8.3 msecs a pretty major penalty.

== specific to 1 KB pages ==

Alignment won't help much in the general case. If you write a 1KB random i/o to a 4KB page, the drive is still forced to do:

read track, modify data, write track.

If the 1KB page is unaligned, it is the same except in the rare case that the 1 KB is split over 2 tracks where it becomes:

read track1, modify data, write track1 seek disk if needed read track2, modify data, write track2

But a 1KB page being split between 2 tracks should happen much less than 1% of the time.

If you are going to use drives with 4KB physical sectors you need to avoid filesystems with 1KB blocks.

Greg -- Greg Freemyer Hi I have found the fio tool and this script (fio.bash); http://www.ansatt.hig.no/erikh/sysadm/fio.bash

To at least give me a benchmark of what disk performance is like (I used for setting up bcache). You can tweak the tests for the blocksize. -- Cheers Malcolm °¿° LFCS, SUSE Knowledge Partner (Linux Counter #276890) SUSE Linux Enterprise Desktop 12 GNOME 3.10.1 Kernel 3.12.36-38-default up 19:38, 3 users, load average: 0.43, 0.39, 0.39 CPU AMD A4-5150M APU @ 3.3GHz | GPU Richland Radeon HD 8350G -- To unsubscribe, e-mail: opensuse+unsubscribe@opensuse.org To contact the owner, e-mail: opensuse+owner@opensuse.org

Felix Miata

19 Mar 19 Mar

09:03

Greg Freemyer composed on 2015-03-16 09:32 (UTC-0400):

...

If the workload is small 1 page writes, then the penalty is huge. For every write a read/modify/write cycle has to be implemented.

Is it really? What kind of use case produces many writes of different small files in sequence or short order? About the only one I can think of is package management extracting from rpms config files, but there the configs would typically be interspersed among much larger binaries, loosing the impact of the penalty within a much larger overall operation.

...

You would need to benchmark your typical load, but if you do lots of small i/o's, I think you will find that 8.3 msecs a pretty major penalty.

This highlights why I brought up the subject of not having located benchmarking done by others in the earlier thread. Other than the brief mention by Chris of linux-raid, I've not noticed anyone mention exposure to or experience with such benchmarking. I'm not a strong believer in benchmarking always being fairly representative of real life operation either. I'm thinking that on older systems that are slower to start with, and having been using disks much slower than what evolution has since provided, that disks replaced would have been significantly slower than anything with 4k on the platters, with net post-upgrade performance results simply less improved by using newer without aligning, not slower than with the older. IOW, my suspicion is there would generally be little if any observable penalty in an upgrade context, compared to benchmarking using all recent hardware. I do have one 3.0GHz P4 HT testing machine with 2G RAM and only 64 bit installations that I wonder about. hdparm reports 74MB/sec. It rather routinely seems slower than Socket A systems running 2GHz or slower, and slower GHz 32 bit P4s without HT. It's legacy-partitioned HD was purchased for cheap used, manufactured by Seagate, but model HP432337004 GB0500C4413, firmware HPG1, manufactured September 2007. Maybe it's one of the earliest 512e devices out the door, claiming to have 512 byte sectors, as reported by hdparm --getpbsz and --getss, hdparm -I and parted -l,but in fact with 4k on the platters and emulation in disguise, being suboptimally handled by open source kernels and drivers on non-HP hardware? -- "The wise are known for their understanding, and pleasant words are persuasive." Proverbs 16:21 (New Living Translation) Team OS/2 ** Reg. Linux User #211409 ** a11y rocks! Felix Miata *** http://fm.no-ip.com/ -- To unsubscribe, e-mail: opensuse+unsubscribe@opensuse.org To contact the owner, e-mail: opensuse+owner@opensuse.org

Per Jessen

09:53

Felix Miata wrote:

...

Greg Freemyer composed on 2015-03-16 09:32 (UTC-0400):

...
If the workload is small 1 page writes, then the penalty is huge. For every write a read/modify/write cycle has to be implemented.

Is it really? What kind of use case produces many writes of different small files in sequence or short order?

Perhaps a busy email server. -- Per Jessen, Zürich (11.4°C) http://www.dns24.ch/ - your free DNS host, made in Switzerland. -- To unsubscribe, e-mail: opensuse+unsubscribe@opensuse.org To contact the owner, e-mail: opensuse+owner@opensuse.org

Greg Freemyer

13:09

On Thu, Mar 19, 2015 at 5:53 AM, Per Jessen wrote:

...

Felix Miata wrote:

...
Greg Freemyer composed on 2015-03-16 09:32 (UTC-0400):

...
If the workload is small 1 page writes, then the penalty is huge. For every write a read/modify/write cycle has to be implemented.

Is it really? What kind of use case produces many writes of different small files in sequence or short order?

Almost any database server. Untarring the kernel source tarball? More about untar-ing a tarball: Even with files an average size of 1MB I believe there would be significant impact. Remember inodes are less than 4KB, so every file create involves inode updates. If the filesystem knows you have 4KB physical sectors, it tries hard to only send 4KB writes. If you have 1KB pages setup, it will send 1KB of inode updates at a time. Everyone of those will take an extra platter rotation. Basically, any work load where the average write is less than a full track will see a major penalty for sure if the writes are not properly sized and aligned to the physical sectors.. (Often 1MB/track is a reasonable guesstimate. Again it varies by where on the drive you are writing.)

...

Perhaps a busy email server.

I can tell you parsing 50GB of PST files on rotating rust can take days whereas the same workload goes to hours with SSD. (a few million seeks really adds up and most rotating drives can only do hundreds of random i/o's per second.) I would expect doing the same thing on a rotating drive with 1KB pages, but 4KB sectors would take almost twice as long. (ie. a full week?) (Clearly, I do this work on SSDs when I can.) With my tool of choice, every email has to be read out of the PST, dumped into an EML, then a follow-on process reads every EML and adds the metadata to a database. Lots of the EML files are small and all of the database updates are small. Take a 5 KB EML is a reasonable example of poor situation: A 5 KB write is a full 4KB sector and a partial sector. The full sector is not a problem. It is a pure write. The problem is the 1KB page at the end of the file. A 4KB sector drive will NOT allow that to go straight to disk. Instead it has to read the current contents of the 4KB physical sector, modify the first KB of it, then write the full sector back out. The reason for the read / modify / write cycle is the ECC information in the header / footer of the physical sector. If the drive allowed a partial physical sector write, the ECC data would be immediately out of date. Therefore, the drive only ever writes full physical sectors. So for that 1KB at the end of the EML file, the drive has to: read the physical sector: wait 8.3 msecs for the platter to rotate around; write the updated physical sector.). For ease of calculation, lets round 8.3 msecs to 10msecs. If you have a million 5KB emails to create on disk, that's 10,000 seconds of time wasted waiting for the disk to rotate around. bad enough, but I forgot to say you need to update a million inodes as well. Those will also not be full physical sector updates, so double that to 20,000 wasted seconds. That's roughly 6 wasted hours for creating a million files. In my business life, I kick off processes often that create a 1 million write workload of relatively small writes. Greg -- To unsubscribe, e-mail: opensuse+unsubscribe@opensuse.org To contact the owner, e-mail: opensuse+owner@opensuse.org

Per Jessen

13:58

Greg Freemyer wrote:

...

On Thu, Mar 19, 2015 at 5:53 AM, Per Jessen wrote:

...
Felix Miata wrote:

...
Greg Freemyer composed on 2015-03-16 09:32 (UTC-0400):

...
If the workload is small 1 page writes, then the penalty is huge. For every write a read/modify/write cycle has to be implemented.

Is it really? What kind of use case produces many writes of different small files in sequence or short order?

Almost any database server.

I have a couple of mariadb/mysql installation, I don't think they do "many writes of different small files in sequence or short order". I mean, some, but the tables don't tend to be small files and the access is more random than sequential.

...

Untarring the kernel source tarball?

Yup.

...

...
Perhaps a busy email server.

I can tell you parsing 50GB of PST files on rotating rust can take days whereas the same workload goes to hours with SSD. (a few million seeks really adds up and most rotating drives can only do hundreds of random i/o's per second.)

I was thinking more of e.g. a postfix installation and the queue files. Or a dovecot ditto. -- Per Jessen, Zürich (15.1°C) http://www.dns24.ch/ - free dynamic DNS, made in Switzerland. -- To unsubscribe, e-mail: opensuse+unsubscribe@opensuse.org To contact the owner, e-mail: opensuse+owner@opensuse.org

3376

Age (days ago)

3379

Last active (days ago)

List overview

Download

18 comments

8 participants

participants (8)

Andrei Borzenkov
Anton Aylward
Chris Murphy
Felix Miata
Greg Freemyer
Linda Walsh
Malcolm
Per Jessen

[opensuse] Performance penalty for mis-aligned partitions on a 4K physical sector drive

tags

participants (8)