[opensuse] smartctl: is this HD really OK?
I have a multiboot system that has long delays at selecting Grub menu items. The Linux-bzImage messages appear, and should be shortly followed by the initrd message, but that's where the delay occurs. It can be as long as 3 minutes. It only happens with some installations and not others, all of which are normally being started from the same Grub on sda3, but can be chainloaded to, which doesn't eliminate the delays. Even after redoing Grub setup for 13.2, chainloading to its Grub triggers a reboot. This HD spent a bit over 3 years in a STB, after which I replaced it, which I did because the STB was stalling inexplicably. After the replacement, I used the WD utility to evaluate the device. It found bad sectors, and claimed it corrected them, that the device passed. I umounted all partitions except / while booted to TW, and ran e2fsck -f on each installations /. All passed, except the last tried (Kubuntu), which resulted in "error reading block 34307 (attempt to read block from filesystem resulted in short read) while reading indirect blocks of inode 10863. Ignore error<y>?" This repeated for blocks 34308, 34309, 590029, 590030, 688204 and several more, followed by directory corruption for inodes 68166, 98879 & yada. Most of these partitions are ext4. 11.4, 12.1 & 12.2 are ext3. Since it "passed" it has spent at least 20 hours in a PC, after I cloned from its original HD to the "passed" HD and switched them. NAICT, smartctl thinks this HD is OK, but is it really? I can't think of any other explanation for boot delays like this except for trouble reading sectors containing kernels or initrds - installing a new kernel/initrd can switch an installation from suffering the delays to one not, and vice versa. I asked about this on the linux-ide mailing list 2 weeks ago, but got no feedback there: http://marc.info/?l=linux-ide&m=142517892627960&w=2 The problem has me frustrated in that I cannot remember whether it started before or after the HD switch. If it started before, obviously trouble is not with the "passed" HD. This is a test system, so it doesn't really have anything "important" on it, other than the time invested to reconfigure on account of device IDs being different (e.g. /boot/grub/device.map). # smartctl -t long /dev/sda (more than 2 hours later) # smartctl -H /dev/sda smartctl 6.3 2015-02-08 r4039 [x86_64-linux-3.19.1-1-desktop] (SUSE RPM) Copyright (C) 2002-14, Bruce Allen, Christian Franke, www.smartmontools.org === START OF READ SMART DATA SECTION === SMART overall-health self-assessment test result: PASSED # smartctl -l selftest /dev/sda smartctl 6.3 2015-02-08 r4039 [x86_64-linux-3.19.1-1-desktop] (SUSE RPM) Copyright (C) 2002-14, Bruce Allen, Christian Franke, www.smartmontools.org === START OF READ SMART DATA SECTION === SMART Self-test log structure revision number 1 Num Test_Description Status Remaining LifeTime(hours) LBA_of_first_error # 1 Extended offline Completed: read failure 90% 27071 252915937 # 2 Extended offline Completed: read failure 90% 27070 252915955 # 3 Conveyance offline Completed without error 00% 27051 - # 4 Extended offline Completed: read failure 90% 27046 460886821 # 5 Short offline Completed without error 00% 3 - # smartctl -A /dev/sda smartctl 6.3 2015-02-08 r4039 [x86_64-linux-3.19.1-1-desktop] (SUSE RPM) Copyright (C) 2002-14, Bruce Allen, Christian Franke, www.smartmontools.org === START OF READ SMART DATA SECTION === SMART Attributes Data Structure revision number: 16 Vendor Specific SMART Attributes with Thresholds: ID# ATTRIBUTE_NAME FLAG VALUE WORST THRESH TYPE UPDATED WHEN_FAILED RAW_VALUE 1 Raw_Read_Error_Rate 0x002f 200 200 051 Pre-fail Always - 3210 3 Spin_Up_Time 0x0027 156 153 021 Pre-fail Always - 3191 4 Start_Stop_Count 0x0032 100 100 000 Old_age Always - 869 5 Reallocated_Sector_Ct 0x0033 200 200 140 Pre-fail Always - 0 7 Seek_Error_Rate 0x002e 100 253 000 Old_age Always - 0 9 Power_On_Hours 0x0032 063 063 000 Old_age Always - 27074 10 Spin_Retry_Count 0x0032 100 100 000 Old_age Always - 0 11 Calibration_Retry_Count 0x0032 100 100 000 Old_age Always - 0 12 Power_Cycle_Count 0x0032 100 100 000 Old_age Always - 862 192 Power-Off_Retract_Count 0x0032 200 200 000 Old_age Always - 29 193 Load_Cycle_Count 0x0032 200 200 000 Old_age Always - 869 194 Temperature_Celsius 0x0022 112 073 000 Old_age Always - 31 196 Reallocated_Event_Count 0x0032 200 200 000 Old_age Always - 0 197 Current_Pending_Sector 0x0032 195 173 000 Old_age Always - 306 198 Offline_Uncorrectable 0x0030 100 253 000 Old_age Offline - 0 199 UDMA_CRC_Error_Count 0x0032 200 200 000 Old_age Always - 2 200 Multi_Zone_Error_Rate 0x0008 100 253 000 Old_age Offline - 0 smartctl -x data: http://fm.no-ip.com/Tmp/Hardware/Disk/smart-wdAVGP3200AVVS.txt -- "The wise are known for their understanding, and pleasant words are persuasive." Proverbs 16:21 (New Living Translation) Team OS/2 ** Reg. Linux User #211409 ** a11y rocks! Felix Miata *** http://fm.no-ip.com/ -- To unsubscribe, e-mail: opensuse+unsubscribe@opensuse.org To contact the owner, e-mail: opensuse+owner@opensuse.org
On March 15, 2015 5:50:58 AM EDT, Felix Miata <mrmazda@earthlink.net> wrote:
I have a multiboot system that has long delays at selecting Grub menu items. The Linux-bzImage messages appear, and should be shortly followed by the initrd message, but that's where the delay occurs. It can be as long as 3 minutes. It only happens with some installations and not others, all of which are normally being started from the same Grub on sda3, but can be chainloaded to, which doesn't eliminate the delays. Even after redoing Grub setup for 13.2, chainloading to its Grub triggers a reboot.
This HD spent a bit over 3 years in a STB, after which I replaced it, which I did because the STB was stalling inexplicably. After the replacement, I used the WD utility to evaluate the device. It found bad sectors, and claimed it corrected them, that the device passed.
I umounted all partitions except / while booted to TW, and ran e2fsck -f on each installations /. All passed, except the last tried (Kubuntu), which resulted in "error reading block 34307 (attempt to read block from filesystem resulted in short read) while reading indirect blocks of inode 10863. Ignore error<y>?" This repeated for blocks 34308, 34309, 590029, 590030, 688204 and several more, followed by directory corruption for inodes 68166, 98879 & yada. Most of these partitions are ext4. 11.4, 12.1 & 12.2 are ext3.
Since it "passed" it has spent at least 20 hours in a PC, after I cloned from its original HD to the "passed" HD and switched them. NAICT, smartctl thinks this HD is OK, but is it really? I can't think of any other explanation for boot delays like this except for trouble reading sectors containing kernels or initrds - installing a new kernel/initrd can switch an installation from suffering the delays to one not, and vice versa.
I asked about this on the linux-ide mailing list 2 weeks ago, but got no feedback there: http://marc.info/?l=linux-ide&m=142517892627960&w=2
The problem has me frustrated in that I cannot remember whether it started before or after the HD switch. If it started before, obviously trouble is not with the "passed" HD.
This is a test system, so it doesn't really have anything "important" on it, other than the time invested to reconfigure on account of device IDs being different (e.g. /boot/grub/device.map).
# smartctl -t long /dev/sda
(more than 2 hours later)
# smartctl -H /dev/sda smartctl 6.3 2015-02-08 r4039 [x86_64-linux-3.19.1-1-desktop] (SUSE RPM) Copyright (C) 2002-14, Bruce Allen, Christian Franke, www.smartmontools.org
=== START OF READ SMART DATA SECTION === SMART overall-health self-assessment test result: PASSED
# smartctl -l selftest /dev/sda smartctl 6.3 2015-02-08 r4039 [x86_64-linux-3.19.1-1-desktop] (SUSE RPM) Copyright (C) 2002-14, Bruce Allen, Christian Franke, www.smartmontools.org
=== START OF READ SMART DATA SECTION === SMART Self-test log structure revision number 1 Num Test_Description Status Remaining LifeTime(hours) LBA_of_first_error # 1 Extended offline Completed: read failure 90% 27071 252915937 # 2 Extended offline Completed: read failure 90% 27070 252915955 # 3 Conveyance offline Completed without error 00% 27051 - # 4 Extended offline Completed: read failure 90% 27046 460886821 # 5 Short offline Completed without error 00% 3 -
Read failures at 125 gb and 230 gb roughly
# smartctl -A /dev/sda smartctl 6.3 2015-02-08 r4039 [x86_64-linux-3.19.1-1-desktop] (SUSE RPM) Copyright (C) 2002-14, Bruce Allen, Christian Franke, www.smartmontools.org
=== START OF READ SMART DATA SECTION === SMART Attributes Data Structure revision number: 16 Vendor Specific SMART Attributes with Thresholds: ID# ATTRIBUTE_NAME FLAG VALUE WORST THRESH TYPE UPDATED WHEN_FAILED RAW_VALUE 1 Raw_Read_Error_Rate 0x002f 200 200 051 Pre-fail Always - 3210 3 Spin_Up_Time 0x0027 156 153 021 Pre-fail Always - 3191 4 Start_Stop_Count 0x0032 100 100 000 Old_age Always - 869 5 Reallocated_Sector_Ct 0x0033 200 200 140 Pre-fail Always - 0 7 Seek_Error_Rate 0x002e 100 253 000 Old_age Always - 0 9 Power_On_Hours 0x0032 063 063 000 Old_age Always - 27074 10 Spin_Retry_Count 0x0032 100 100 000 Old_age Always - 0 11 Calibration_Retry_Count 0x0032 100 100 000 Old_age Always - 0 12 Power_Cycle_Count 0x0032 100 100 000 Old_age Always - 862 192 Power-Off_Retract_Count 0x0032 200 200 000 Old_age Always - 29 193 Load_Cycle_Count 0x0032 200 200 000 Old_age Always - 869 194 Temperature_Celsius 0x0022 112 073 000 Old_age Always - 31 196 Reallocated_Event_Count 0x0032 200 200 000 Old_age Always - 0 197 Current_Pending_Sector 0x0032 195 173 000 Old_age Always - 306 198 Offline_Uncorrectable 0x0030 100 253 000 Old_age Offline - 0 199 UDMA_CRC_Error_Count 0x0032 200 200 000 Old_age Always - 2 200 Multi_Zone_Error_Rate 0x0008 100 253 000 Old_age Offline - 0
smartctl -x data: http://fm.no-ip.com/Tmp/Hardware/Disk/smart-wdAVGP3200AVVS.txt
The tests are finding media errors. The pending bad sectors is 306. For a lot of drives actual sector reallocation only happens on write, so those bad sectors will stay bad until you write to them. Personally I'd toss that drive. If you want to salvage it, boot a live CD / DVD and run a data data destroying sequence like: dd if=/dev/zero of=/dev/sda conv=noerror bs=4k That should force all pending sector reallocates to happen. (of course it wipes your drive in the process). dd if=/dev/sda of=/dev/null conv=noerror bs=4k Consider that a diagnostic; it will report the number of failed sectors in the output. If that reports any errors on the second dd, think very hard about tossing the drive. If you are still thinking about keeping it, then use shred to exercise the disk even more. shred --verbose -n1 /dev/sda That will write a single pass of pseudo random data to the drive. Then repeat the above dd read of the entire drive. If the dd reports any read errors, repeat the shred/dd pair until you get at least one totally clean run. Greg -- Sent from my Android device with K-9 Mail. Please excuse my brevity. -- To unsubscribe, e-mail: opensuse+unsubscribe@opensuse.org To contact the owner, e-mail: opensuse+owner@opensuse.org
On 03/15/2015 08:46 AM, greg.freemyer@gmail.com wrote:
Personally I'd toss that drive.
What's the incentive to keep it? Ho-Hum 1T drives are around US$50 on the high street outlets in the computer district (College Street, Toronto) and even 4T drives are a LOT less than a new unlocked iPhone. Let's not even talk about this winter's utility bills. I'm a great one for salvaging old equipment from the Closet of Anxieties but recognise when the error rates get to high, the capacitors start popping or things overheat or make unseemly noises. A hobby is great; parsimony is great. But don't sweat the hard stuff. Sometimes that's not sweat, its blood. -- A: Yes. > Q: Are you sure? >> A: Because it reverses the logical flow of conversation. >>> Q: Why is top posting frowned upon? -- To unsubscribe, e-mail: opensuse+unsubscribe@opensuse.org To contact the owner, e-mail: opensuse+owner@opensuse.org
Anton Aylward composed on 2015-03-15 09:09 (UTC-0400):
What's the incentive to keep it?
It doesn't have 4k sectors, last of breed. I've yet to find a real world evaluation of the performance penalty imposed by not aligning to 4k. The original has lots of 255/63 partitions. Most of my filesystems use 1k blocks. I usually have only one large blocksize partition use for truly large files, like CD & dvd isos. The backup system includes/assumes 255/63 too. I don't want to conform to 4k before I know what the nuts & bolts cost of non-conformance is. Maybe I'd be happiest by ignoring alignment in test systems, conforming only in 24/7 systems, if even then.
Ho-Hum 1T drives are around US$50 on the high street outlets in the computer district (College Street, Toronto) and even 4T drives are a LOT less than a new unlocked iPhone. Let's not even talk about this winter's utility bills.
There are costs other than purchase price, especially if buying used.
I'm a great one for salvaging old equipment from the Closet of Anxieties but recognise when the error rates get to high, the capacitors start popping or things overheat or make unseemly noises. A hobby is great; parsimony is great. But don't sweat the hard stuff. Sometimes that's not sweat, its blood.
Junk accumulates according to the amount of space available for it to fill. HDs >.5TB in test systems is mega-overkill. Most can get by on well under 200MB. -- "The wise are known for their understanding, and pleasant words are persuasive." Proverbs 16:21 (New Living Translation) Team OS/2 ** Reg. Linux User #211409 ** a11y rocks! Felix Miata *** http://fm.no-ip.com/ -- To unsubscribe, e-mail: opensuse+unsubscribe@opensuse.org To contact the owner, e-mail: opensuse+owner@opensuse.org
On 03/15/2015 11:21 AM, Felix Miata wrote:
It doesn't have 4k sectors, last of breed. I've yet to find a real world evaluation of the performance penalty imposed by not aligning to 4k. The original has lots of 255/63 partitions. Most of my filesystems use 1k blocks. I usually have only one large blocksize partition use for truly large files, like CD & dvd isos. The backup system includes/assumes 255/63 too. I don't want to conform to 4k before I know what the nuts & bolts cost of non-conformance is. Maybe I'd be happiest by ignoring alignment in test systems, conforming only in 24/7 systems, if even then.
Have you figures/experience on the benefits of 4K? -- A: Yes. > Q: Are you sure? >> A: Because it reverses the logical flow of conversation. >>> Q: Why is top posting frowned upon? -- To unsubscribe, e-mail: opensuse+unsubscribe@opensuse.org To contact the owner, e-mail: opensuse+owner@opensuse.org
Anton Aylward composed on 2015-03-15 11:53 (UTC-0400):
Felix Miata wrote:
It doesn't have 4k sectors, last of breed. I've yet to find a real world evaluation of the performance penalty imposed by not aligning to 4k. The original has lots of 255/63 partitions. Most of my filesystems use 1k blocks. I usually have only one large blocksize partition use for truly large files, like CD & dvd isos. The backup system includes/assumes 255/63 too. I don't want to conform to 4k before I know what the nuts & bolts cost of non-conformance is. Maybe I'd be happiest by ignoring alignment in test systems, conforming only in 24/7 systems, if even then.
Have you figures/experience on the benefits of 4K?
No. That's something I was hoping would materialize out of this thread. I know little other than what Seagate reported long ago here: http://www.seagate.com/tech-insights/advanced-format-4k-sector-hard-drives-m... -- "The wise are known for their understanding, and pleasant words are persuasive." Proverbs 16:21 (New Living Translation) Team OS/2 ** Reg. Linux User #211409 ** a11y rocks! Felix Miata *** http://fm.no-ip.com/ -- To unsubscribe, e-mail: opensuse+unsubscribe@opensuse.org To contact the owner, e-mail: opensuse+owner@opensuse.org
On Sun, Mar 15, 2015 at 9:21 AM, Felix Miata <mrmazda@earthlink.net> wrote:
I've yet to find a real world evaluation of the performance penalty imposed by not aligning to 4k.
Umm? There is no good reason to not align, so why does this matter? And for two, there's piles of evaluation of this on linux-raid@ and it's all over the map because the penalty depends highly on workload, and the firmware revision. Some drives are just really slow with the internal read-modify-write, and others have optimized this so it's a minimal effect. The only way to evaluate this is to test it, but why bother? Just align properly. All the tools used these days do this correctly, parted, fdisk, gdisk, and so on.
The original has lots of 255/63 partitions. Most of my filesystems use 1k blocks.
I seriously doubt that. And I don't think it's even possible because the normal kernel pagesize on x86 is 4KiB since forever, and the file system block size has to be a multiple of that which makes 4KiB the minimum fs block size on x86. For tools at least 5 years old, ext4, XFS, and Btrfs all default to 4KiB block sizes. Btrfs still uses a 4KiB block size but now uses a 16KiB node/leaf size by default for storing metadata and inline data.
I usually have only one large blocksize partition use for truly large files, like CD & dvd isos. The backup system includes/assumes 255/63 too.
Are you referring to CHS addressing? This hasn't been used in Linux with any drive since probably 15 years or more.
I don't want to conform to 4k before I know what the nuts & bolts cost of non-conformance is. Maybe I'd be happiest by ignoring alignment in test systems, conforming only in 24/7 systems, if even then.
You are needlessly making a big deal over a long ago solved problem. It occasionally remains an issue with dual boot installs with Windows XP and newish drives, because XP's partitioner starts the first partition at LBA 63, which is not aligned. It sometimes comes up with older enterprise linux that for some reason dragged their feet applying patches to start the 1st sector at LBA 2048. But I'd be surprised if you're using any version of openSUSE released in the past ~8 years that does this incorrectly. But all you have to do is look at the start LBA for each partition. Is it divisible by 8 (or 2048 if you want 1MiB alignment - it's easier)? If yes, you're done. If no, start over. -- Chris Murphy -- To unsubscribe, e-mail: opensuse+unsubscribe@opensuse.org To contact the owner, e-mail: opensuse+owner@opensuse.org
On Sun, Mar 15, 2015 at 10:03 AM, Chris Murphy <lists@colorremedies.com> wrote:
For tools at least 5 years old, ext4, XFS, and Btrfs all default to 4KiB block sizes.
OK so I went back to RHEL 2.1 circa 2003 with kernel 2.4.9 and e2fsprogs feb 2002. By default the installer uses ext3 with a 4KiB block size. 'mke2fs /dev/hdb' results in ext2 with 4KiB block size, so that's the default for ~13 years. However, I was allowed to format with blocksize 1KiB and I could read and write to it. But I can't think of any reason why you'd do this on purpose. -- Chris Murphy -- To unsubscribe, e-mail: opensuse+unsubscribe@opensuse.org To contact the owner, e-mail: opensuse+owner@opensuse.org
В Sun, 15 Mar 2015 11:14:38 -0600 Chris Murphy <lists@colorremedies.com> пишет:
'mke2fs /dev/hdb' results in ext2 with 4KiB block size, so that's the default for ~13 years. However, I was allowed to format with blocksize 1KiB and I could read and write to it. But I can't think of any reason why you'd do this on purpose.
To reduce wasted space with small files? -- To unsubscribe, e-mail: opensuse+unsubscribe@opensuse.org To contact the owner, e-mail: opensuse+owner@opensuse.org
On Sun, Mar 15, 2015 at 11:29 AM, Andrei Borzenkov <arvidjaar@gmail.com> wrote:
В Sun, 15 Mar 2015 11:14:38 -0600 Chris Murphy <lists@colorremedies.com> пишет:
'mke2fs /dev/hdb' results in ext2 with 4KiB block size, so that's the default for ~13 years. However, I was allowed to format with blocksize 1KiB and I could read and write to it. But I can't think of any reason why you'd do this on purpose.
To reduce wasted space with small files?
Fair enough. For a 102MB partition: # mkfs.ext3 /dev/sdb1 mke2fs 1.42.11 (09-Jul-2014) Creating filesystem with 104388 1k blocks and 26104 inodes # mkfs.ext4 /dev/sdb1 mke2fs 1.42.11 (09-Jul-2014) Creating filesystem with 104388 1k blocks and 26104 inodes For a 5.94g LV: # mkfs.ext4 /dev/mapper/VolGroup00-LogVol00 mke2fs 1.42.11 (09-Jul-2014) Creating filesystem with 1556480 4k blocks and 389376 inodes # mkfs.ext4 -b 1024 /dev/mapper/VolGroup00-LogVol00 mke2fs 1.42.11 (09-Jul-2014) Creating filesystem with 6225920 1k blocks and 389120 inodes # mke2fs -b 1024 /dev/mapper/VolGroup00-LogVol00 mke2fs 1.42.11 (09-Jul-2014) Creating filesystem with 6225920 1k blocks and 389120 inodes Weird, I don't know why -b 1024 gave me an error with a much older e2fsprogs from CentOS 4.0, but that's ancient times so I'm not going to try and find out what's going on. Anyway, clearly the default for a long time now is 4KiB on anything other than a small partition. -- Chris Murphy -- To unsubscribe, e-mail: opensuse+unsubscribe@opensuse.org To contact the owner, e-mail: opensuse+owner@opensuse.org
On Sun, Mar 15, 2015 at 10:14 AM, Chris Murphy <lists@colorremedies.com> wrote:
'mke2fs /dev/hdb' results in ext2 with 4KiB block size, so that's the default for ~13 years. However, I was allowed to format with blocksize 1KiB and I could read and write to it. But I can't think of any reason why you'd do this on purpose.
The point of having a 4 KiB filesystem block size is to ensure that data is read and write in contiguous disk blocks. When Kirk McKusick was developing UFS, he noticed a performance hit of as large as 50% when parts of a file were not forced to be stored in adjacent blocks. Reading or writing from contiguous blocks from a disk is the fastest way to access mechanical storage. On the flip side, making the filesystem block size larger creates the issue of wasted disk space, so 4 KiB can be considered a reasonable tradeoff. Brandon Vincent -- To unsubscribe, e-mail: opensuse+unsubscribe@opensuse.org To contact the owner, e-mail: opensuse+owner@opensuse.org
On 03/15/2015 01:41 PM, Brandon Vincent wrote:
The point of having a 4 KiB filesystem block size is to ensure that data is read and write in contiguous disk blocks. When Kirk McKusick was developing UFS, he noticed a performance hit of as large as 50% when parts of a file were not forced to be stored in adjacent blocks. Reading or writing from contiguous blocks from a disk is the fastest way to access mechanical storage.
On the flip side, making the filesystem block size larger creates the issue of wasted disk space, so 4 KiB can be
LOL! History repeats itself. I recall this argument being used back in the 1980s when the Berkeley Fast File System replaced the V7 FS. The old V7 had 512 byte blocks, the FFS had 1K blocks. Of course the argument might simply be 'inflation' as we've moved from 16-bit systems to 64-bit systems ... -- A: Yes. > Q: Are you sure? >> A: Because it reverses the logical flow of conversation. >>> Q: Why is top posting frowned upon? -- To unsubscribe, e-mail: opensuse+unsubscribe@opensuse.org To contact the owner, e-mail: opensuse+owner@opensuse.org
On Sun, Mar 15, 2015 at 11:14 AM, Chris Murphy <lists@colorremedies.com> wrote:
On Sun, Mar 15, 2015 at 10:03 AM, Chris Murphy <lists@colorremedies.com> wrote:
For tools at least 5 years old, ext4, XFS, and Btrfs all default to 4KiB block sizes.
OK so I went back to RHEL 2.1 circa 2003 with kernel 2.4.9 and e2fsprogs feb 2002. By default the installer uses ext3 with a 4KiB block size.
'mke2fs /dev/hdb' results in ext2 with 4KiB block size, so that's the default for ~13 years. However, I was allowed to format with blocksize 1KiB and I could read and write to it. But I can't think of any reason why you'd do this on purpose.
CentOS 4.0 4, February 2005. Uses Linux kernel 2.6.9-5. It will not let me mke2fs -b 1024, I get a "bad block size" error. And 4.0 through 4.8, parted starts partition 1 at LBA 63, so I'm going to hope/guess this fixed by CentOS 5 released in 2007. I know it's fixed by CentOS 6 in 2010. -- Chris Murphy -- To unsubscribe, e-mail: opensuse+unsubscribe@opensuse.org To contact the owner, e-mail: opensuse+owner@opensuse.org
Chris Murphy composed on 2015-03-15 10:03 (UTC-0600):
Felix Miata wrote:
The original has lots of 255/63 partitions. Most of my filesystems use 1k blocks.
I seriously doubt that. And I don't think it's even possible because the normal kernel pagesize on x86 is 4KiB since forever, and the file system block size has to be a multiple of that which makes 4KiB the minimum fs block size on x86. For tools at least 5 years old, ext4, XFS, and Btrfs all default to 4KiB block sizes. Btrfs still uses a 4KiB block size but now uses a 16KiB node/leaf size by default for storing metadata and inline data.
From the HD resulting in my OP:
(excerpt from my partitioner's log) Disk 1 L-Geo from: LVM info at PSN 0x0000003e ACCEPTED as matching DLAT sector Disk 1 forcing : cylinders from 38913 to 38914 L-Geo Disk 1 Cyl : 38914 H:255 S:63 Bps:512 Size : 0x25431582 = 305250 MiB S-Geo Disk 1 Cyl : 38913 H:255 S:63 Bps:512 Size : 0x2542D6C1 = 305242 MiB # fdisk -l /dev/sda Disk /dev/sda: 320.1 GB, 320072933376 bytes, 625142448 sectors Units = sectors of 1 * 512 = 512 bytes Sector size (logical/physical): 512 bytes / 512 bytes I/O size (minimum/optimal): 512 bytes / 512 bytes Disk label type: dos Disk identifier: 0xdf5ee123 Device Boot Start End Blocks Id System /dev/sda1 63 80324 40131 6 FAT16 /dev/sda2 80325 578339 249007+ 1c Hidden W95 FAT32 (LBA) /dev/sda3 * 578340 1397654 409657+ 83 Linux /dev/sda4 1397655 625137344 311869845 5 Extended /dev/sda5 1397718 4466069 1534176 82 Linux swap / Solaris /dev/sda6 4466133 4482134 8001 1 FAT12 /dev/sda7 4482198 4996214 257008+ 6 FAT16 /dev/sda8 4996278 19743884 7373803+ 83 Linux /dev/sda9 19743948 28756349 4506201 83 Linux /dev/sda10 28756413 40226759 5735173+ 83 Linux /dev/sda11 40226823 78959474 19366326 83 Linux /dev/sda12 78959538 81417419 1228941 83 Linux /dev/sda13 81417483 91249199 4915858+ 83 Linux /dev/sda14 91249263 101080979 4915858+ 83 Linux /dev/sda15 101081043 110912759 4915858+ 83 Linux /dev/sda16 110912823 120744539 4915858+ 83 Linux /dev/sda17 120744603 130576319 4915858+ 83 Linux /dev/sda18 130576383 140408099 4915858+ 83 Linux /dev/sda19 140408163 150239879 4915858+ 83 Linux /dev/sda20 150239943 160071659 4915858+ 83 Linux /dev/sda21 160071723 171542069 5735173+ 83 Linux /dev/sda22 171542133 181373849 4915858+ 83 Linux /dev/sda23 252525798 262357514 4915858+ 83 Linux /dev/sda24 262357578 272189294 4915858+ 83 Linux /dev/sda25 272189358 282021074 4915858+ 83 Linux /dev/sda26 282021138 291852854 4915858+ 83 Linux /dev/sda27 291852918 301684634 4915858+ 83 Linux /dev/sda28 301684698 311516414 4915858+ 83 Linux /dev/sda29 311516478 624093119 156288321 83 Linux /dev/sda30 624093183 625137344 522081 e W95 FAT16 (LBA) # tune2fs -l /dev/sda3 | egrep 'ume|mounte|ck si|de si' Filesystem volume name: wv32boot03 Last mounted on: /disks/boot Block size: 1024 Filesystem created: Mon Sep 10 18:21:41 2012 # tune2fs -l /dev/sda8 | egrep 'ume|mounte|ck si|de si' Filesystem volume name: wv32os122 Last mounted on: /disks/suse122 Block size: 1024 Filesystem created: Mon Sep 10 18:23:39 2012 # tune2fs -l /dev/sda9 | egrep 'ume|mounte|ck si|de si' Filesystem volume name: wv32home09 Last mounted on: /home Block size: 1024 Filesystem created: Sun Oct 7 09:04:29 2012 # tune2fs -l /dev/sda10 | egrep 'ume|mounte|ck si|de si' Filesystem volume name: wv32os121 Last mounted on: /disks/suse121 Block size: 1024 Filesystem created: Sun Oct 7 09:05:22 2012 # tune2fs -l /dev/sda11 | egrep 'ume|mounte|ck si|de si' Filesystem volume name: wv32pub11 Last mounted on: /pub Block size: 1024 Filesystem created: Sat Dec 1 08:08:54 2012 # tune2fs -l /dev/sda12 | egrep 'ume|mounte|ck si|de si' Filesystem volume name: wv32usrlcl Last mounted on: /usr/local Block size: 1024 Filesystem created: Wed Feb 13 17:19:15 2013 # tune2fs -l /dev/sda13 | egrep 'ume|mounte|ck si|de si' Filesystem volume name: wv32f21 Last mounted on: / Block size: 1024 Filesystem created: Fri Jul 12 01:34:41 2013 # tune2fs -l /dev/sda14 | egrep 'ume|mounte|ck si|de si' Filesystem volume name: wv32f22 Last mounted on: / Block size: 1024 Filesystem created: Fri Jul 12 01:34:41 2013 # tune2fs -l /dev/sda15 | egrep 'ume|mounte|ck si|de si' Filesystem volume name: wv32mag5 Last mounted on: /sysroot Block size: 1024 Filesystem created: Sun Dec 2 21:41:54 2012 # tune2fs -l /dev/sda16 | egrep 'ume|mounte|ck si|de si' Filesystem volume name: wv32os123 Last mounted on: /disks/suse123 Block size: 1024 Filesystem created: Wed Feb 13 17:24:23 2013 # tune2fs -l /dev/sda17 | egrep 'ume|mounte|ck si|de si' Filesystem volume name: wv32os131 Last mounted on: /root Block size: 1024 Filesystem created: Wed Feb 13 17:24:23 2013 # tune2fs -l /dev/sda18 | egrep 'ume|mounte|ck si|de si' Filesystem volume name: wv32mag4 Last mounted on: /sysroot Block size: 1024 Filesystem created: Sun Dec 2 21:41:54 2012 # tune2fs -l /dev/sda19 | egrep 'ume|mounte|ck si|de si' Filesystem volume name: wv32os133 Last mounted on: / Block size: 1024 Filesystem created: Wed Feb 13 17:24:23 2013 # tune2fs -l /dev/sda20 | egrep 'ume|mounte|ck si|de si' Filesystem volume name: wv32f23 Last mounted on: / Block size: 1024 Filesystem created: Fri Jul 12 01:34:41 2013 # tune2fs -l /dev/sda21 | egrep 'ume|mounte|ck si|de si' Filesystem volume name: wv32os114 Last mounted on: /disks/suse114 Block size: 1024 Filesystem created: Sat Mar 29 02:53:15 2014 # tune2fs -l /dev/sda22 | egrep 'ume|mounte|ck si|de si' Filesystem volume name: wv32os132 Last mounted on: / Block size: 1024 Filesystem created: Wed Feb 13 17:24:23 2013 # tune2fs -l /dev/sda23 | egrep 'ume|mounte|ck si|de si' Filesystem volume name: wv32buntu Last mounted on: / Block size: 4096 Filesystem created: Sat Oct 25 04:06:48 2014 # tune2fs -l /dev/sda29 | egrep 'ume|mounte|ck si|de si' Filesystem volume name: wv32isos Last mounted on: /mnt Block size: 4096 Filesystem created: Mon Mar 2 01:08:37 2015 -- "The wise are known for their understanding, and pleasant words are persuasive." Proverbs 16:21 (New Living Translation) Team OS/2 ** Reg. Linux User #211409 ** a11y rocks! Felix Miata *** http://fm.no-ip.com/ -- To unsubscribe, e-mail: opensuse+unsubscribe@opensuse.org To contact the owner, e-mail: opensuse+owner@opensuse.org
On Sun, Mar 15, 2015 at 12:17 PM, Felix Miata <mrmazda@earthlink.net> wrote:
I seriously doubt that. And I don't think it's even possible because the normal kernel pagesize on x86 is 4KiB since forever, and the file system block size has to be a multiple of that which makes 4KiB the minimum fs block size on x86. For tools at least 5 years old, ext4, XFS, and Btrfs all default to 4KiB block sizes. Btrfs still uses a 4KiB block size but now uses a 16KiB node/leaf size by default for storing metadata and inline data.
From the HD resulting in my OP:
Per e2fsprogs, if you are creating a filesystem under 3 MiB, the block size of the filesystem is set to 1 KiB, an inode size of 128 bytes, and a byte to inode ratio of 8196. Anything else under 512 MiB is created with a block size of 1 KiB, an inode size of 128 bytes, and a byte to inode ratio of 4096. For most filesystems (under 4 TiB) the defaults are a block size of 4 KiB, and inode size of 256 bytes, and a byte to inode ratio of 16384. For larger filesystems (under 16 TiB), the only change is a byte to inode ratio of 32768. Finally, if you somehow have a block device greater than 16 TiB, the ratio is set to the largest permissible value for a short integer, 65536. IIRC, for ext4, the inode size is set to 256 bytes regardless of the size of the filesystem. Brandon Vincent -- To unsubscribe, e-mail: opensuse+unsubscribe@opensuse.org To contact the owner, e-mail: opensuse+owner@opensuse.org
On Sun, Mar 15, 2015 at 1:17 PM, Felix Miata <mrmazda@earthlink.net> wrote:
Chris Murphy composed on 2015-03-15 10:03 (UTC-0600):
Felix Miata wrote:
The original has lots of 255/63 partitions. Most of my filesystems use 1k blocks.
I seriously doubt that.
# tune2fs -l /dev/sda3 | egrep 'ume|mounte|ck si|de si' Filesystem volume name: wv32boot03 Last mounted on: /disks/boot Block size: 1024 Filesystem created: Mon Sep 10 18:21:41 2012
That's awesome, I love it when I'm wrong. But I'll bet dollars to donuts that mkfs will automatically use a 4096 byte block size when physical sector size is 4096 bytes due to this commit from 2010: http://git.whamcloud.com/gitweb?p=tools/e2fsprogs.git;a=commitdiff_plain;h=9... For a while now the logical/physical sector information has correctly propogated through LVM to mkfs. I did stumble on a blockdev bug with a 512e drive in a USB enclosure. smartctl and hdparm sees it as 512 byte logical, 4096 byte physical. But parted which uses blockdev sees it as 512 byte physical, and so does mkfs.xfs. The consequence is somewhat minor because the XFS block size is 4096 anyway, but the minimum unit for the journal is set to physical sector size, so it's possible to get undesirable RWM by the drive firmware when doing journal writes. That thread is here: http://oss.sgi.com/archives/xfs/2015-01/msg00076.html The work around is to explicitly set the physical sector size at mkfs time (for XFS). -- Chris Murphy -- To unsubscribe, e-mail: opensuse+unsubscribe@opensuse.org To contact the owner, e-mail: opensuse+owner@opensuse.org
On Sun, Mar 15, 2015 at 1:09 PM, Chris Murphy <lists@colorremedies.com> wrote:
That's awesome, I love it when I'm wrong. But I'll bet dollars to donuts that mkfs will automatically use a 4096 byte block size when physical sector size is 4096 bytes due to this commit from 2010: http://git.whamcloud.com/gitweb?p=tools/e2fsprogs.git;a=commitdiff_plain;h=9...
That git commit deals with overriding the defaults in e2fsprogs for debugging purposes by setting an environment variable. For block devices under 512 MiB, the default ext2/3/4 block size is still set at 1 KiB. Brandon Vincent -- To unsubscribe, e-mail: opensuse+unsubscribe@opensuse.org To contact the owner, e-mail: opensuse+owner@opensuse.org
On Sun, Mar 15, 2015 at 6:55 PM, Brandon Vincent <Brandon.Vincent@asu.edu> wrote:
On Sun, Mar 15, 2015 at 1:09 PM, Chris Murphy <lists@colorremedies.com> wrote:
That's awesome, I love it when I'm wrong. But I'll bet dollars to donuts that mkfs will automatically use a 4096 byte block size when physical sector size is 4096 bytes due to this commit from 2010: http://git.whamcloud.com/gitweb?p=tools/e2fsprogs.git;a=commitdiff_plain;h=9...
That git commit deals with overriding the defaults in e2fsprogs for debugging purposes by setting an environment variable. For block devices under 512 MiB, the default ext2/3/4 block size is still set at 1 KiB.
I got the wrong commit, it's this one from e2fsprogs 1.41.12 "When mke2fs is ratcheting down the blocksize for small filesystems, or when a blocksize is specified on the commandline, we should not willingly go below the physical sector size of the device." http://git.kernel.org/cgit/fs/ext2/e2fsprogs.git/commit/?id=bb1158b92ed8a12a... Also, the release notes for e2fsprogs 1.41.10 (February 10, 2010) says it will warn at mkfs time if there's misalignment. http://git.kernel.org/cgit/fs/ext2/e2fsprogs.git/commit/?id=ecced2c3586fed83... -- Chris Murphy -- To unsubscribe, e-mail: opensuse+unsubscribe@opensuse.org To contact the owner, e-mail: opensuse+owner@opensuse.org
On Sun, Mar 15, 2015 at 7:34 PM, Chris Murphy <lists@colorremedies.com> wrote:
I got the wrong commit, it's this one from e2fsprogs 1.41.12 "When mke2fs is ratcheting down the blocksize for small filesystems, or when a blocksize is specified on the commandline, we should not willingly go below the physical sector size of the device." http://git.kernel.org/cgit/fs/ext2/e2fsprogs.git/commit/?id=bb1158b92ed8a12a...
I tried this on a 1 TiB external drive and e2fsprogs wrongly detected the drive as having a physical sector size of 512 bytes. By using the commit you accidentally referred to earlier, forcing e2fsprogs to see the drive as having a 4 KiB physical sector size did indeed force a 100 MiB partition to be created with a 4 KiB block size. So this commit does indeed seem to work as intended.
Also, the release notes for e2fsprogs 1.41.10 (February 10, 2010) says it will warn at mkfs time if there's misalignment. http://git.kernel.org/cgit/fs/ext2/e2fsprogs.git/commit/?id=ecced2c3586fed83...
With the 4 KiB physical sector size being forced by the MKE2FS_DEVICE_SECTSIZE environment variable, no warning was received when a filesystem was created on a misaligned partition. I have a feeling though that this check might not be respecting the environment variable, so it would probably be useful if someone could verify this on an internal Advanced Format drive that is properly reported by e2fsprogs. These tests were conducted on a RHEL 6 machine with a version of e2fsprogs with the above mentioned commits. Brandon Vincent -- To unsubscribe, e-mail: opensuse+unsubscribe@opensuse.org To contact the owner, e-mail: opensuse+owner@opensuse.org
В Sun, 15 Mar 2015 20:39:10 -0700 Brandon Vincent <Brandon.Vincent@asu.edu> пишет:
On Sun, Mar 15, 2015 at 7:34 PM, Chris Murphy <lists@colorremedies.com> wrote:
I got the wrong commit, it's this one from e2fsprogs 1.41.12 "When mke2fs is ratcheting down the blocksize for small filesystems, or when a blocksize is specified on the commandline, we should not willingly go below the physical sector size of the device." http://git.kernel.org/cgit/fs/ext2/e2fsprogs.git/commit/?id=bb1158b92ed8a12a...
I tried this on a 1 TiB external drive and e2fsprogs wrongly detected the drive as having a physical sector size of 512 bytes.
Do you mean that kernel information was correct but e2fsprogs' not? -- To unsubscribe, e-mail: opensuse+unsubscribe@opensuse.org To contact the owner, e-mail: opensuse+owner@opensuse.org
On Sun, Mar 15, 2015 at 8:46 PM, Andrei Borzenkov <arvidjaar@gmail.com> wrote:
Do you mean that kernel information was correct but e2fsprogs' not?
The I/O topology of the kernel reports the physical sector size as 512 bytes (which is incorrect). e2fsprogs seems to use this value. hdparm reports a correct value (via S.M.A.R.T.). So the problem is with the hard drive enclosure not e2fsprogs (I should have made that clear). I've seen on many occasions, USB to SATA bridges mess with how the sectors are reported to the system. Brandon Vincent -- To unsubscribe, e-mail: opensuse+unsubscribe@opensuse.org To contact the owner, e-mail: opensuse+owner@opensuse.org
On Sun, Mar 15, 2015 at 10:03 PM, Brandon Vincent <Brandon.Vincent@asu.edu> wrote:
On Sun, Mar 15, 2015 at 8:46 PM, Andrei Borzenkov <arvidjaar@gmail.com> wrote:
Do you mean that kernel information was correct but e2fsprogs' not?
The I/O topology of the kernel reports the physical sector size as 512 bytes (which is incorrect). e2fsprogs seems to use this value.
That's this thread I referred to earlier: http://oss.sgi.com/archives/xfs/2015-01/msg00076.html Your 512e drive is in a USB enclosure I take it? I'm going to guess that there's something about the bridge chipset that's confusing the kernel, whereas hdparm and smartctl get this information directly (by using SCT commands to query the drive). -- Chris Murphy -- To unsubscribe, e-mail: opensuse+unsubscribe@opensuse.org To contact the owner, e-mail: opensuse+owner@opensuse.org
On Sun, Mar 15, 2015 at 9:48 PM, Chris Murphy <lists@colorremedies.com> wrote:
Your 512e drive is in a USB enclosure I take it? I'm going to guess that there's something about the bridge chipset that's confusing the kernel, whereas hdparm and smartctl get this information directly (by using SCT commands to query the drive).
Yes. Those USB enclosures can be frustrating. Usually I see 512e drives show up with a physical/logical sector size of 512 bytes. Some of them actually report a 4 KiB physical/logical sector size when you connect large capacity drive, to permit using an MBR partitioning to address a drive over 2 TiB on certain proprietary operating systems. Brandon Vincent -- To unsubscribe, e-mail: opensuse+unsubscribe@opensuse.org To contact the owner, e-mail: opensuse+owner@opensuse.org
On Sun, Mar 15, 2015 at 11:10 PM, Brandon Vincent <Brandon.Vincent@asu.edu> wrote:
On Sun, Mar 15, 2015 at 9:48 PM, Chris Murphy <lists@colorremedies.com> wrote:
Your 512e drive is in a USB enclosure I take it? I'm going to guess that there's something about the bridge chipset that's confusing the kernel, whereas hdparm and smartctl get this information directly (by using SCT commands to query the drive).
Yes. Those USB enclosures can be frustrating.
Mine is a cheapie $5 PUREX. It reports 4096 byte physical sector as 512 bytes. But smartctl works as does hdparm, which is an improvement over a previous $40 enclosure which got all of this stuff wrong (but it could maybe take a bullet, or at least a car could run over it).
Usually I see 512e drives show up with a physical/logical sector size of 512 bytes. Some of them actually report a 4 KiB physical/logical sector size when you connect large capacity drive, to permit using an MBR partitioning to address a drive over 2 TiB on certain proprietary operating systems.
Yeah, what a bad hack. If the drive is removed and directly attached, it's no longer readable. They should just report what the drive reports. -- Chris Murphy -- To unsubscribe, e-mail: opensuse+unsubscribe@opensuse.org To contact the owner, e-mail: opensuse+owner@opensuse.org
On Sun, Mar 15, 2015 at 10:17 PM, Chris Murphy <lists@colorremedies.com> wrote:
Yeah, what a bad hack. If the drive is removed and directly attached, it's no longer readable. They should just report what the drive reports.
That is the worst problem. Have fun trying to recover data when the cheap bridge adapter board fails, and the only adapters you can find properly report sector size. You'd probably need to use dd and a loop mount to pull data off. Brandon Vincent -- To unsubscribe, e-mail: opensuse+unsubscribe@opensuse.org To contact the owner, e-mail: opensuse+owner@opensuse.org
Chris Murphy composed on 2015-03-15 10:03 (UTC-0600):
Felix Miata wrote:
I've yet to find a real world evaluation of the performance penalty imposed by not aligning to 4k.
Umm? There is no good reason to not align,
Maybe for you.
so why does this matter?
Because to me backwards compat across many multiboot installations is a plus. I see a size in partition size in inventory, and have a good idea what it means. Identical sizes are easily cloned as testing various hardware combinations dictates. Switching from legacy to aligned means two incompatible inventory sets.
And for two, there's piles of evaluation of this on linux-raid@ and it's all over the map because the penalty depends highly on workload, and the firmware revision. Some drives are just really slow with the internal read-modify-write, and others have optimized this so it's a minimal effect. The only way to evaluate this is to test it,
I'd like to see whatever data exists. Things get in the way of looking for it. Penalty for sustaining status quo isn't high.
but why bother? Just align properly. All the tools used these days do this correctly, parted, fdisk, gdisk, and so on.
I don't use "tools". I use one tool. If I want aligned, I get aligned, but it isn't forced upon me as if the only true way.
I usually have only one large blocksize partition use for truly large files, like CD & dvd isos. The backup system includes/assumes 255/63 too.
Are you referring to CHS addressing? This hasn't been used in Linux with any drive since probably 15 years or more.
It produced an understood alignment convention heavily implemented here long before HD makers decided to complicate life by changing physical sector sizes.
I don't want to conform to 4k before I know what the nuts & bolts cost of non-conformance is. Maybe I'd be happiest by ignoring alignment in test systems, conforming only in 24/7 systems, if even then.
You are needlessly making a big deal over a long ago solved problem.
On the contrary, the new solution created foreseeable and unforeseeable fallout, as so often happens as development evolves and developers decide backwards compat is too much trouble to keep.
It occasionally remains an issue with dual boot installs with Windows XP and newish drives, because XP's partitioner starts the first partition at LBA 63, which is not aligned.
It's aligned, just not to a 4k multiple. It's aligned to a multiple of 255*63 or 240*63, de facto standards before the HD makers decided everyone needed only huge storage devices, for which 512 byte physical sectors had to dodo.
It sometimes comes up with older enterprise linux that for some reason dragged their feet applying patches to start the 1st sector at LBA 2048. But I'd be surprised if you're using any version of openSUSE released in the past ~8 years that does this incorrectly.
As I don't use but one partitioner, all partitioning is necessarily done before any OS installers are run. How they do what they do is irrelevant here.
But all you have to do is look at the start LBA for each partition. Is it divisible by 8 (or 2048 if you want 1MiB alignment - it's easier)? If yes, you're done. If no, start over.
Starting over costs. I avoid fixing what ain't broke, leaving more time to find what does need fixing. FWIW, thread's HD is out, replacement in. Problems that are obviously a result of HD bad are gone, but the inexplicable long delays at kernel/initrd load time remain for TW, 13.2, Cauldron and Rawhide. 11.4, 12.1, 12.2, 12.3, 13.1, F21 & F22 load right up normally. Getting this far has been an extra pain due to the initrd married to root's UUID problem we've reported.[1] I hadn't remembered to undo hostonly on the Mageias as I had on the Fedoras, and didn't need to think about for TW or 13.2. https://bugzilla.redhat.com/show_bug.cgi?id=1187007 https://bugs.mageia.org/show_bug.cgi?id=15500 -- "The wise are known for their understanding, and pleasant words are persuasive." Proverbs 16:21 (New Living Translation) Team OS/2 ** Reg. Linux User #211409 ** a11y rocks! Felix Miata *** http://fm.no-ip.com/ -- To unsubscribe, e-mail: opensuse+unsubscribe@opensuse.org To contact the owner, e-mail: opensuse+owner@opensuse.org
On Mon, Mar 16, 2015 at 2:37 AM, Felix Miata <mrmazda@earthlink.net> wrote:
Chris Murphy composed on 2015-03-15 10:03 (UTC-0600):
so why does this matter?
Because to me backwards compat across many multiboot installations is a plus. I see a size in partition size in inventory, and have a good idea what it means.
None of this changes, the logical sector size is still 512 bytes.
but why bother? Just align properly. All the tools used these days do this correctly, parted, fdisk, gdisk, and so on.
I don't use "tools". I use one tool. If I want aligned, I get aligned, but it isn't forced upon me as if the only true way.
What tool? And you're not forced, you can choose to not align, but the firmware will do read-modify-write internally as a consequence. If you're asking for no changes in the world, ever, well, I can't help you with that. Pick another area of interest that doesn't ever change anything? This change was inevitable, and it was necessary.
Are you referring to CHS addressing? This hasn't been used in Linux with any drive since probably 15 years or more.
It produced an understood alignment convention heavily implemented here long before HD makers decided to complicate life by changing physical sector sizes.
You mean complicate life with zone bit recording, which completely divorced any relationship between CHS and the actual on disk location of a physical sector? Because CHS went the way of the dodo bird, about 20 years ago in this respect. It was only kept on in virtual form, in effect it was logical addressing before LBA arrived. The physical sector size changing has absolutely nothing to do with CHS being completely irrelevant.
I don't want to conform to 4k before I know what the nuts & bolts cost of non-conformance is. Maybe I'd be happiest by ignoring alignment in test systems, conforming only in 24/7 systems, if even then.
You are needlessly making a big deal over a long ago solved problem.
On the contrary, the new solution created foreseeable and unforeseeable fallout, as so often happens as development evolves and developers decide backwards compat is too much trouble to keep.
You're confused. The whole point of 512e drives is *expressly* for backwards compatibility. Otherwise the manufacturers would have gone directly to 4Kn drives.
It occasionally remains an issue with dual boot installs with Windows XP and newish drives, because XP's partitioner starts the first partition at LBA 63, which is not aligned.
It's aligned, just not to a 4k multiple. It's aligned to a multiple of 255*63 or 240*63, de facto standards before the HD makers decided everyone needed only huge storage devices, for which 512 byte physical sectors had to dodo.
Your CHS multiples are aligned to nothing at all physically on the drive. It's been a complete abstraction, since at least two major epochs in computing ago.
It sometimes comes up with older enterprise linux that for some reason dragged their feet applying patches to start the 1st sector at LBA 2048. But I'd be surprised if you're using any version of openSUSE released in the past ~8 years that does this incorrectly.
As I don't use but one partitioner, all partitioning is necessarily done before any OS installers are run. How they do what they do is irrelevant here.
But all you have to do is look at the start LBA for each partition. Is it divisible by 8 (or 2048 if you want 1MiB alignment - it's easier)? If yes, you're done. If no, start over.
Starting over costs. I avoid fixing what ain't broke, leaving more time to find what does need fixing.
There are still piles of 512n drives floating around out there, and some of them aren't exactly small. I'm know they go up to 2TB at least, there might even be some 3TB+ 512n drives. -- Chris Murphy -- To unsubscribe, e-mail: opensuse+unsubscribe@opensuse.org To contact the owner, e-mail: opensuse+owner@opensuse.org
Chris Murphy composed on 2015-03-16 13:20 (UTC-0600):
Felix Miata wrote:
Chris Murphy composed on 2015-03-15 10:03 (UTC-0600):
so why does this matter?
Because to me backwards compat across many multiboot installations is a plus. I see a size in partition size in inventory, and have a good idea what it means.
None of this changes, the logical sector size is still 512 bytes.
Only when everthing behaves as expected. On which mailing list, Google hit or bug tracker I don't remember, but less than 3 days ago I read about logical vs physical sector size reporting incompatibility among various Linux tools due to interposing a USB case between HD and kernel. Besides thinking it was something Karel Zak wrote, I'm unable to remember enough to find cite. :-(
but why bother? Just align properly. All the tools used these days do this correctly, parted, fdisk, gdisk, and so on.
I don't use "tools". I use one tool. If I want aligned, I get aligned, but it isn't forced upon me as if the only true way.
What tool?
This I've answered publicly many times, but I try not to look like I'm trying to sell it when posting. I only have a license to it, which I renew for each major upgrade. One the reasons I use it exclusively is its great automatic logging, which enables me to easily inventory what I have, and find what I need when I need it for repro and triage on appropriate hardware and software combinations. http://fm.no-ip.com/Tmp/Dfsee/big41L19.txt is the excerpt I have from its most recent logs, used as inventory for the HD that replaced the one instigating this thread. http://fm.no-ip.com/Tmp/Dfsee/dfsL011.txt is a complete unedited log from a fleeting session on a 4k aligned HD.
I don't want to conform to 4k before I know what the nuts & bolts cost of non-conformance is. Maybe I'd be happiest by ignoring alignment in test systems, conforming only in 24/7 systems, if even then.
You are needlessly making a big deal over a long ago solved problem.
On the contrary, the new solution created foreseeable and unforeseeable fallout, as so often happens as development evolves and developers decide backwards compat is too much trouble to keep.
You're confused.
About many things, yes, but you're misinterpreting what I wrote.
The whole point of 512e drives is *expressly* for backwards compatibility. Otherwise the manufacturers would have gone directly to 4Kn drives.
Of course, but like everything people design, it isn't perfect. Compromise, mother of mediocrity, isn't always avoidable.
It's aligned, just not to a 4k multiple. It's aligned to a multiple of 255*63 or 240*63, de facto standards before the HD makers decided everyone needed only huge storage devices, for which 512 byte physical sectors had to dodo.
Your CHS multiples are aligned to nothing at all physically on the drive. It's been a complete abstraction, since at least two major epochs in computing ago.
4k, like 512, is also an abstraction. Even though the bytes are advertised as, and universally considered to be discrete bits, on or off with no state in between, the recording surfaces at the foundational level are analog, a discovery I made only a week or so ago. Chris Murphy composed on 2015-03-16 13:28 (UTC-0600):
Felix Miata wrote:
Per Jessen composed on 2015-03-16 16:00 (UTC+0100):
Anton Aylward wrote:
I suppose all late model disks are 4K :-)
Probably the larger ones, but I have a bunch of 2Tb drives, they're all 512bytes.
If the actual date of manufacture (often missing from labels of refurbs) is post-2010, it is almost certainly 4k, regardless of size. The only 2TB devices I have were manufactured before 2011. IOW, Anton was on track, everything made in the past 4+ years is 4k.
HGST lists the Ultrastar 7K4000's as a 2013 product. They come in 512n versions up to 4TB. http://www.hgst.com/tech/techlib.nsf/techdocs/FD3F376DC2ECCE68882579D40082C3... http://www.hgst.com/tech/techlib.nsf/techdocs/9E4E119077AD1D8B86256DD0005A2F...
I got snowed on this subject. The very existence of "512n" as a label is news to me as of this thread. How it was enabled: 1-I repeatedly misread http://www.seagate.com/tech-insights/advanced-format-4k-sector-hard-drives-m... confusing the stated 2011 start date agreed to by all HD makers with a full implementation date. 2-Googling 'site:seagate.com 512n sector' for me produced no hits on seagate.com that I opened that actually included the string "512n". 3-(IIRC) I must never have looked other than 1 & 2 for coverage of the subject. That happened as a result of avoiding WD products, and the WD web site, and a mistaken understanding of the relationship between WD and and HGST just resolved in recent hours, leading me to never search the HGST site either. Last new HGST product I purchased was around its first year after it broke ties to IBM. 4-Searching vendor sites when in need of product, missing any mention of the existence of "512n", frustrated by the preponderance of "AF" and "Advanced Format" on product pages. 5-Errant personal assumptions WRT HD production capacity reduction and consequent restoration following 311 earthquake. My previous thread post is a good example of why mailing list replies belong on list by default[1]. People asking the list for help deserve the auditing that private replies rather well block. [1] http://marc.merlins.org/netrants/reply-to-useful.html -- "The wise are known for their understanding, and pleasant words are persuasive." Proverbs 16:21 (New Living Translation) Team OS/2 ** Reg. Linux User #211409 ** a11y rocks! Felix Miata *** http://fm.no-ip.com/ -- To unsubscribe, e-mail: opensuse+unsubscribe@opensuse.org To contact the owner, e-mail: opensuse+owner@opensuse.org
-----BEGIN PGP SIGNED MESSAGE----- Hash: SHA256 On 2015-03-17 02:56, Felix Miata wrote:
4k, like 512, is also an abstraction. Even though the bytes are advertised as, and universally considered to be discrete bits, on or off with no state in between, the recording surfaces at the foundational level are analog, a discovery I made only a week or so ago.
All digital electronics are basically analog underneath. Assume a supply voltage of 5 volts. Typically it is said that a #0 is zero volts, and a #1 is 5 volts. Not true. It is rather something like "between 0 and 2 volts it is considered a zero, and above 3 volts it is considered a one". Notice that there is a gray undefined area in the middle. Basically, as speeds rise the voltages may not have time to reach a value that can clearly be interpreted as a one or a zero. Magnetic media is similar... on modern disks with so high densities the device has to "decide" if what it is seeing is a one or a zero. Sometimes it makes mistakes. Apparently, they are so probable that they need error correction methods. You see this by looking at the raw error figures out of smartctl. - -- Cheers / Saludos, Carlos E. R. (from 13.1 x86_64 "Bottle" (Minas Tirith)) -----BEGIN PGP SIGNATURE----- Version: GnuPG v2.0.22 (GNU/Linux) iF4EAREIAAYFAlUIg+0ACgkQja8UbcUWM1z+dAD/fdDGGJqmU/3gAO50Cylm8xFG of4jysUINk/W/QZ+Ca8A/iFWJ8rdFU9YdfKq5t28kXdGs1pQUzmlz0THP/+1czSa =ECIS -----END PGP SIGNATURE----- -- To unsubscribe, e-mail: opensuse+unsubscribe@opensuse.org To contact the owner, e-mail: opensuse+owner@opensuse.org
On Tue, Mar 17, 2015 at 3:43 PM, Carlos E. R. <robin.listas@telefonica.net> wrote:
On 2015-03-17 02:56, Felix Miata wrote:
4k, like 512, is also an abstraction. Even though the bytes are advertised as, and universally considered to be discrete bits, on or off with no state in between, the recording surfaces at the foundational level are analog, a discovery I made only a week or so ago.
All digital electronics are basically analog underneath. Assume a supply voltage of 5 volts. Typically it is said that a #0 is zero volts, and a #1 is 5 volts. Not true. It is rather something like "between 0 and 2 volts it is considered a zero, and above 3 volts it is considered a one". Notice that there is a gray undefined area in the middle.
Basically, as speeds rise the voltages may not have time to reach a value that can clearly be interpreted as a one or a zero.
Magnetic media is similar... on modern disks with so high densities the device has to "decide" if what it is seeing is a one or a zero. Sometimes it makes mistakes. Apparently, they are so probable that they need error correction methods. You see this by looking at the raw error figures out of smartctl.
You're both leaving out part of the picture. Just like a packet on Ethernet has a header and a footer, a sector on a hard drive does as well. It has basic data to let the drive know exactly what sector on the drive it is, but it also has ECC info that allows multi-bit errors to be corrected. The move from 512 byte sectors to 4KB sectors revolves around the wasted space taken up by ECC info. With a 512 byte sector the ECC data was getting to be too high a percentage of the total raw sector size. By going to 4KB sectors, the ECC data is reduced from a percentage of the total perspective. If you want to actually read in a raw sector with the header and footer intact, I believe you can do that with hdparm's -long_read command. I've never tried to actually parse what that brings back, but you should be able to get at least some basic understanding of it without too much trouble. Greg -- Greg Freemyer -- To unsubscribe, e-mail: opensuse+unsubscribe@opensuse.org To contact the owner, e-mail: opensuse+owner@opensuse.org
-----BEGIN PGP SIGNED MESSAGE----- Hash: SHA256 On 2015-03-17 21:23, Greg Freemyer wrote:
On Tue, Mar 17, 2015 at 3:43 PM, Carlos E. R. <> wrote:
You're both leaving out part of the picture. Just like a packet on Ethernet has a header and a footer, a sector on a hard drive does as well. It has basic data to let the drive know exactly what sector on the drive it is, but it also has ECC info that allows multi-bit errors to be corrected.
Yes, true. I was thinking of that when I said "error correction methods" :-)
The move from 512 byte sectors to 4KB sectors revolves around the wasted space taken up by ECC info. With a 512 byte sector the ECC data was getting to be too high a percentage of the total raw sector size. By going to 4KB sectors, the ECC data is reduced from a percentage of the total perspective.
Ah. That's interesting. - -- Cheers / Saludos, Carlos E. R. (from 13.1 x86_64 "Bottle" (Minas Tirith)) -----BEGIN PGP SIGNATURE----- Version: GnuPG v2.0.22 (GNU/Linux) iF4EAREIAAYFAlUIlGEACgkQja8UbcUWM1w4EwD+MwYJNAmLj9hCM0COmzhCuX10 jPf9vzzvYAYu3DhOlC4A/jZPaPkZ2YF9SsORFJwKya5ouN4r/jxqeeh/zhAaUgum =0Ddz -----END PGP SIGNATURE----- -- To unsubscribe, e-mail: opensuse+unsubscribe@opensuse.org To contact the owner, e-mail: opensuse+owner@opensuse.org
On 03/17/2015 03:43 PM, Carlos E. R. wrote:
All digital electronics are basically analog underneath. Assume a supply voltage of 5 volts. Typically it is said that a #0 is zero volts, and a #1 is 5 volts. Not true. It is rather something like "between 0 and 2 volts it is considered a zero, and above 3 volts it is considered a one". Notice that there is a gray undefined area in the middle.
Basically, as speeds rise the voltages may not have time to reach a value that can clearly be interpreted as a one or a zero.
Detailed spec sheets such as the classic Texas Instruments TTL handbook had a lot if information on this. I'm surd a bit of googling will show up more modern, more detailed studies. Its surely been the subject of one of the many PhD thesis we see each year. Then there are the capacitive effects on rise times and the inductive effects of the PCB traces acting as 'transmission lines'. After all, we are talking abut GigaHertz frequencies. -- A: Yes. > Q: Are you sure? >> A: Because it reverses the logical flow of conversation. >>> Q: Why is top posting frowned upon? -- To unsubscribe, e-mail: opensuse+unsubscribe@opensuse.org To contact the owner, e-mail: opensuse+owner@opensuse.org
-----BEGIN PGP SIGNED MESSAGE----- Hash: SHA256 On 2015-03-17 23:01, Anton Aylward wrote:
On 03/17/2015 03:43 PM, Carlos E. R. wrote:
Detailed spec sheets such as the classic Texas Instruments TTL handbook had a lot if information on this. I'm surd a bit of googling will show
I know :-) And different logic families have different behaviours and values. But the basics stand: digital electronics are analog underneath. We do not notice this unless we push the limits; normally the digital things stay digital :-) - -- Cheers / Saludos, Carlos E. R. (from 13.1 x86_64 "Bottle" (Minas Tirith)) -----BEGIN PGP SIGNATURE----- Version: GnuPG v2.0.22 (GNU/Linux) iF4EAREIAAYFAlUN31AACgkQja8UbcUWM1x6LQD9GLbuT7ytGf8M5HbBrdu3YMps IXupyBuVSv23RmmAUTwBAIhDs9Pfq3TZbPcWRxLvFtMZh5A+ZUFRBbZPyfqvjoMD =r/yC -----END PGP SIGNATURE----- -- To unsubscribe, e-mail: opensuse+unsubscribe@opensuse.org To contact the owner, e-mail: opensuse+owner@opensuse.org
On 03/15/2015 11:21 AM, Felix Miata wrote:
Junk accumulates according to the amount of space available for it to fill.
"Junk" - Stuff you throw away "Stuff" - Junk you keep ;-) -- To unsubscribe, e-mail: opensuse+unsubscribe@opensuse.org To contact the owner, e-mail: opensuse+owner@opensuse.org
On Sun, Mar 15, 2015 at 1:20 PM, James Knott <james.knott@rogers.com> wrote:
On 03/15/2015 11:21 AM, Felix Miata wrote:
Junk accumulates according to the amount of space available for it to fill.
"Junk" - Stuff you throw away "Stuff" - Junk you keep
;-)
Haha. That's precision. -- Chris Murphy -- To unsubscribe, e-mail: opensuse+unsubscribe@opensuse.org To contact the owner, e-mail: opensuse+owner@opensuse.org
James Knott composed on 2015-03-15 15:20 (UTC-0400):
Felix Miata wrote:
Junk accumulates according to the amount of space available for it to fill.
"Junk" - Stuff you throw away "Stuff" - Junk you keep
;-)
Anybody besides me who looks at my "stuff" sees junk. :-) -- "The wise are known for their understanding, and pleasant words are persuasive." Proverbs 16:21 (New Living Translation) Team OS/2 ** Reg. Linux User #211409 ** a11y rocks! Felix Miata *** http://fm.no-ip.com/ -- To unsubscribe, e-mail: opensuse+unsubscribe@opensuse.org To contact the owner, e-mail: opensuse+owner@opensuse.org
On 03/15/2015 03:50 PM, Felix Miata wrote:
Anybody besides me who looks at my "stuff" sees junk. :-)
People put what they see as "junk" in what they term the "discards closet", pending disposal. I see it as "opportunities". -- A: Yes. > Q: Are you sure? >> A: Because it reverses the logical flow of conversation. >>> Q: Why is top posting frowned upon? -- To unsubscribe, e-mail: opensuse+unsubscribe@opensuse.org To contact the owner, e-mail: opensuse+owner@opensuse.org
-----BEGIN PGP SIGNED MESSAGE----- Hash: SHA1 On 15/03/15 21:55, Anton Aylward wrote:
On 03/15/2015 03:50 PM, Felix Miata wrote:
Anybody besides me who looks at my "stuff" sees junk. :-)
People put what they see as "junk" in what they term the "discards closet", pending disposal.
I see it as "opportunities".
Forwarded to my wife. Not expecting a reply. ;-) - -- Bob Williams System: Linux 3.16.7-7-desktop Distro: openSUSE 13.2 (x86_64) with KDE Development Platform: 4.14.3 Uptime: 06:00am up 7:55, 3 users, load average: 0.16, 0.05, 0.06 -----BEGIN PGP SIGNATURE----- Version: GnuPG v2 iEYEARECAAYFAlUZblQACgkQ0Sr7eZJrmU5s9ACfaHg50wz8QHr3d3aBRDdfFw7e RJoAoKohKNVJEBHUncWydq+5byBZPu+/ =P022 -----END PGP SIGNATURE----- -- To unsubscribe, e-mail: opensuse+unsubscribe@opensuse.org To contact the owner, e-mail: opensuse+owner@opensuse.org
greg.freemyer@gmail.com composed on 2015-03-15 08:46 (UTC-0400):
# smartctl -l selftest /dev/sda === START OF READ SMART DATA SECTION === SMART Self-test log structure revision number 1 Num Test_Description Status Remaining LifeTime(hours) LBA_of_first_error # 1 Extended offline Completed: read failure 90% 27071 252915937 # 2 Extended offline Completed: read failure 90% 27070 252915955 # 3 Conveyance offline Completed without error 00% 27051 - # 4 Extended offline Completed: read failure 90% 27046 460886821 # 5 Short offline Completed without error 00% 3 -
Read failures at 125 gb and 230 gb roughly
LBA@ interpretation completely escaped my notice. :-p The only two things that got my attention were the "read failure"s and "90%"s. :-(
ID# ATTRIBUTE_NAME FLAG VALUE WORST THRESH TYPE UPDATED WHEN_FAILED RAW_VALUE 5 Reallocated_Sector_Ct 0x0033 200 200 140 Pre-fail Always - 0 9 Power_On_Hours 0x0032 063 063 000 Old_age Always - 27074 197 Current_Pending_Sector 0x0032 195 173 000 Old_age Always - 306 smartctl -x data: http://fm.no-ip.com/Tmp/Hardware/Disk/smart-wdAVGP3200AVVS.txt
The tests are finding media errors. The pending bad sectors is 306.
For a lot of drives actual sector reallocation only happens on write, so those bad sectors will stay bad until you write to them.
So non-zero pending sectors seems to be one of the redder flags, and 306 of them means really bad news. IIRC, this was my first-ever purchase of WD for personal use. Trouble started before 3 years of uptime, but after warranty expired. My first ever personal purchase of a HD was a Seagate, which on a 12 month warranty expired at age 13 months. Next purchase was Quantum. While it existed, I bought virtually nothing else. I tried a Maxtor shortly after the first Quantum, an experience causing me to classify it with WD and Seagate as do not buy. Naturally the expiration and consolidation of brands since then has forced adaptation. When the choice dwindled to only Seagate and WD for 3.5", I stuck with Seagate as the closer adherent to published standards. I did buy Toshiba 3.5" twice last year. One is still installed. The other had to be returned.
Personally I'd toss that drive.
That will happen as soon as I decide what will be replacing it, and how, discussed in the 4k subthread.
If you want to salvage it, boot a live CD / DVD and run a data data destroying sequence like:
dd if=/dev/zero of=/dev/sda conv=noerror bs=4k
That should force all pending sector reallocates to happen. (of course it wipes your drive in the process).
dd if=/dev/sda of=/dev/null conv=noerror bs=4k
Consider that a diagnostic; it will report the number of failed sectors in the output. If that reports any errors on the second dd, think very hard about tossing the drive.
If you are still thinking about keeping it, then use shred to exercise the disk even more.
shred --verbose -n1 /dev/sda
That will write a single pass of pseudo random data to the drive.
Then repeat the above dd read of the entire drive.
If the dd reports any read errors, repeat the shred/dd pair until you get at least one totally clean run.
Much thanks for your response. -- "The wise are known for their understanding, and pleasant words are persuasive." Proverbs 16:21 (New Living Translation) Team OS/2 ** Reg. Linux User #211409 ** a11y rocks! Felix Miata *** http://fm.no-ip.com/ -- To unsubscribe, e-mail: opensuse+unsubscribe@opensuse.org To contact the owner, e-mail: opensuse+owner@opensuse.org
On Sun, Mar 15, 2015 at 3:50 AM, Felix Miata <mrmazda@earthlink.net> wrote:
NAICT, smartctl thinks this HD is OK, but is it really?
As long as the drive reallocates on writes, and none of the prefail attribute values are at threshold, the drive will say it's healthy. There are now quite a few published papers showing that the drive's self assessment isn't helpful a significant minority of the time. Of all the problems SMART reports, increasing numbers of bad sectors is correlated with prefailure. The fact you now have a corrupt file system is consistent with that. a. Dispose of the drive. If you do that, consider using hdparm to leverage the drive's built-in ATA Security Erase command. This is the only way to erase data on sectors that no longer have LBA mapping. This is also a ton faster than writing zeros with dd. http://mackonsti.wordpress.com/2011/11/22/ssd-secure-erase-ata-command b. Use badblocks -svw. This is destructive, and does ~4 passes, write followed by read. I'd let it do at least two write/read passes before canceling it, or just let it complete. If the drive is actually working normally, no errors will be recorded in dmesg or badblocks. The drive will remap those bad sectors internally at write time. Any failures are essentially fatal. A write fail means there are no more reserve sectors for reallocation. A read failure means the drive firmware incorrectly assessed a persistent write failure for a transient one. And a corruption count means some kind of silent data corruption, like a torn write. So if it comes up clean, honestly I still wouldn't trust it I'd relegate it to Btrfs use only (which would have self healed in this instance so long as the default DUP metadata was used). If it has errors, then obliterate it with hdparm and retired it. -- Chris Murphy -- To unsubscribe, e-mail: opensuse+unsubscribe@opensuse.org To contact the owner, e-mail: opensuse+owner@opensuse.org
participants (10)
-
Andrei Borzenkov
-
Anton Aylward
-
Bob Williams
-
Brandon Vincent
-
Carlos E. R.
-
Chris Murphy
-
Felix Miata
-
Greg Freemyer
-
greg.freemyer@gmail.com
-
James Knott