[opensuse] RAID/XFS performance question
Hi, I think there are some big-data guys around here, so I thought I look if someone has a clue on this: I'm running a large (55TB) RAID5 set for our data acquisition system. It's 16 4TB SSDs (Samsung 850 EVO) connected to two LSI MegaRAID SAS-3 3008 cards sitting in an Asus Z170-deluxe mainboard. Disks are in JBOD mode, RAID is formed via mdadm. Filesystem is XFS. In general it is a very nice system, but there is one ununderstandable thing: some of our cameras generate data in single files (~700k/file), and collects the files in a single directory, at 36files/s. So it is ending up with a lot of files. The problem arrives when the data are to be deleted: Doing an rm -rf on a 700GB directory tree (several cameras, several runs, so the data is typically split in some 30-40 subdirectories) takes around 40 MINUTES. Now I know that XFS is not the fastest for this operation, BUT: The computer has an 'emergency RAID set', in case we run out of space. It is a 6x6TB HDD RAID5, connected to the mainboards SATA ports. Also mdadm RAID with XFS. On this (in general performance much slower) 28TB RAID, the same dataset gets deleted in around 2 minutes. I tried (on a different computer though) 'faking' a 16-disk RAID on 4 1TB SSDs with 4 partitions each (on MB SATA ports), that one also deleted 'fast' (1.5 min) So the big question is what is wrong with the SSD RAID? Is it the number of disks, the LSI card, the size of the volume? Did anyone see similar problems before? Any input is highly welcome :) Here's some config info: transport1:~ # mdadm --detail /dev/md0 /dev/md0: Version : 1.2 Creation Time : Fri Apr 28 12:06:22 2017 Raid Level : raid5 Array Size : 58603292160 (55888.45 GiB 60009.77 GB) Used Dev Size : 3906886144 (3725.90 GiB 4000.65 GB) Raid Devices : 16 Total Devices : 16 Persistence : Superblock is persistent Intent Bitmap : Internal Layout : left-symmetric Chunk Size : 512K transport1:~ # xfs_info /dev/md0 meta-data=/dev/md0 isize=256 agcount=55, agsize=268435328 blks = sectsz=512 attr=2, projid32bit=1 = crc=0 finobt=0 spinodes=0 data = bsize=4096 blocks=14650823040, imaxpct=1 = sunit=128 swidth=1920 blks naming =version 2 bsize=4096 ascii-ci=0 ftype=1 log =internal bsize=4096 blocks=521728, version=2 = sectsz=512 sunit=8 blks, lazy-count=1 realtime =none extsz=4096 blocks=0, rtextents=0 -- Dr. Peter "Pit" Suetterlin http://www.astro.su.se/~pit Institute for Solar Physics Tel.: +34 922 405 590 (Spain) P.Suetterlin@royac.iac.es +46 8 5537 8559 (Sweden) Peter.Suetterlin@astro.su.se -- To unsubscribe, e-mail: opensuse+unsubscribe@opensuse.org To contact the owner, e-mail: opensuse+owner@opensuse.org
On Wed, 3 May 2017 15:39:28 +0100 Peter Suetterlin
I think there are some big-data guys around here, so I thought I look if someone has a clue on this
You may get lucky, but there are far more big-data guys around on the XFS mailing list and what's more they have unparalled experience of tweaking XFS configs. Don't forget to read their 'suggestions' for how to report problems. HTH, Dave -- To unsubscribe, e-mail: opensuse+unsubscribe@opensuse.org To contact the owner, e-mail: opensuse+owner@opensuse.org
Hi Peter On Wednesday 03 May 2017, Peter Suetterlin wrote:
Hi,
I think there are some big-data guys around here, so I thought I look if someone has a clue on this:
I'm running a large (55TB) RAID5 set for our data acquisition system. It's 16 4TB SSDs (Samsung 850 EVO) connected to two LSI MegaRAID SAS-3 3008 cards sitting in an Asus Z170-deluxe mainboard. Disks are in JBOD mode, RAID is formed via mdadm. Filesystem is XFS.
In general it is a very nice system, but there is one ununderstandable thing: some of our cameras generate data in single files (~700k/file), and collects the files in a single directory, at 36files/s. So it is ending up with a lot of files.
Raid5 is usually a bad choice for writing, especially with many small files, and especially with that many disks. You may google for "read5 write penalty". The fact that you are using such expensive SSDs indicates that you want performance. Maybe Raid10 should be the better choice. Regarding XFS, I've switched from XFS to EXT4 many years ago because EXT4 was dozens of times faster for creating or deleting many small files. I've heard that XFS has been improved simce then but I've never tested it again. Your particular number "36files/s" on XFS is something which sounds very familar to me. Another thing hardware controllers may disable the write cache of your HDs by default.
The problem arrives when the data are to be deleted: Doing an rm -rf on a 700GB directory tree (several cameras, several runs, so the data is typically split in some 30-40 subdirectories) takes around 40 MINUTES.
Now I know that XFS is not the fastest for this operation, BUT: The computer has an 'emergency RAID set', in case we run out of space. It is a 6x6TB HDD RAID5, connected to the mainboards SATA ports. Also mdadm RAID with XFS. On this (in general performance much slower) 28TB RAID, the same dataset gets deleted in around 2 minutes.
AFAIR such benchmarks are only comparable if both file systems have the same size and content, or even better they are both empty, newly created. I remember that I could never reproduce my old measurements after the file system was in heavy use for some months.
I tried (on a different computer though) 'faking' a 16-disk RAID on 4 1TB SSDs with 4 partitions each (on MB SATA ports), that one also deleted 'fast' (1.5 min)
So the big question is what is wrong with the SSD RAID? Is it the number of disks, the LSI card, the size of the volume? Did anyone see similar problems before? Any input is highly welcome :)
Very interesting. Are your SSDs officially supported by your controller? Professional controllers have usually a list of certified HD models and on the other hand they have usually more incompatibility issues than mainstream hardware. I would contact the vendor and ask about known issues. cu, Rudi
Here's some config info:
transport1:~ # mdadm --detail /dev/md0 /dev/md0: Version : 1.2 Creation Time : Fri Apr 28 12:06:22 2017 Raid Level : raid5 Array Size : 58603292160 (55888.45 GiB 60009.77 GB) Used Dev Size : 3906886144 (3725.90 GiB 4000.65 GB) Raid Devices : 16 Total Devices : 16 Persistence : Superblock is persistent
Intent Bitmap : Internal
Layout : left-symmetric Chunk Size : 512K
transport1:~ # xfs_info /dev/md0 meta-data=/dev/md0 isize=256 agcount=55, agsize=268435328 blks = sectsz=512 attr=2, projid32bit=1 = crc=0 finobt=0 spinodes=0 data = bsize=4096 blocks=14650823040, imaxpct=1 = sunit=128 swidth=1920 blks naming =version 2 bsize=4096 ascii-ci=0 ftype=1 log =internal bsize=4096 blocks=521728, version=2 = sectsz=512 sunit=8 blks, lazy-count=1 realtime =none extsz=4096 blocks=0, rtextents=0
-- Dr. Peter "Pit" Suetterlin http://www.astro.su.se/~pit Institute for Solar Physics Tel.: +34 922 405 590 (Spain) P.Suetterlin@royac.iac.es +46 8 5537 8559 (Sweden) Peter.Suetterlin@astro.su.se
-- To unsubscribe, e-mail: opensuse+unsubscribe@opensuse.org To contact the owner, e-mail: opensuse+owner@opensuse.org
On Wednesday 03 May 2017, Peter Suetterlin wrote:
Hi,
I think there are some big-data guys around here, so I thought I look if someone has a clue on this:
I'm running a large (55TB) RAID5 set for our data acquisition system. It's 16 4TB SSDs (Samsung 850 EVO) connected to two LSI MegaRAID SAS-3 3008 cards sitting in an Asus Z170-deluxe mainboard.
BTW one more thought. It looks like you've spend 20 -25 thousand EUR for HDs and controllers but you are using this on a cheap consumer mainboard. Maybe you should have invested 3-5 thousand EUR more to buy a whole server system designed and manufactured by a professional vendor.
Disks are in JBOD mode, RAID is formed via mdadm. Filesystem is XFS.
In general it is a very nice system, but there is one ununderstandable thing: some of our cameras generate data in single files (~700k/file), and collects the files in a single directory, at 36files/s. So it is ending up with a lot of files.
The problem arrives when the data are to be deleted: Doing an rm -rf on a 700GB directory tree (several cameras, several runs, so the data is typically split in some 30-40 subdirectories) takes around 40 MINUTES.
Now I know that XFS is not the fastest for this operation, BUT: The computer has an 'emergency RAID set', in case we run out of space. It is a 6x6TB HDD RAID5, connected to the mainboards SATA ports. Also mdadm RAID with XFS. On this (in general performance much slower) 28TB RAID, the same dataset gets deleted in around 2 minutes.
I tried (on a different computer though) 'faking' a 16-disk RAID on 4 1TB SSDs with 4 partitions each (on MB SATA ports), that one also deleted 'fast' (1.5 min)
So the big question is what is wrong with the SSD RAID? Is it the number of disks, the LSI card, the size of the volume? Did anyone see similar problems before? Any input is highly welcome :)
Here's some config info:
transport1:~ # mdadm --detail /dev/md0 /dev/md0: Version : 1.2 Creation Time : Fri Apr 28 12:06:22 2017 Raid Level : raid5 Array Size : 58603292160 (55888.45 GiB 60009.77 GB) Used Dev Size : 3906886144 (3725.90 GiB 4000.65 GB) Raid Devices : 16 Total Devices : 16 Persistence : Superblock is persistent
Intent Bitmap : Internal
Layout : left-symmetric Chunk Size : 512K
transport1:~ # xfs_info /dev/md0 meta-data=/dev/md0 isize=256 agcount=55, agsize=268435328 blks = sectsz=512 attr=2, projid32bit=1 = crc=0 finobt=0 spinodes=0 data = bsize=4096 blocks=14650823040, imaxpct=1 = sunit=128 swidth=1920 blks naming =version 2 bsize=4096 ascii-ci=0 ftype=1 log =internal bsize=4096 blocks=521728, version=2 = sectsz=512 sunit=8 blks, lazy-count=1 realtime =none extsz=4096 blocks=0, rtextents=0
-- Dr. Peter "Pit" Suetterlin http://www.astro.su.se/~pit Institute for Solar Physics Tel.: +34 922 405 590 (Spain) P.Suetterlin@royac.iac.es +46 8 5537 8559 (Sweden) Peter.Suetterlin@astro.su.se
-- To unsubscribe, e-mail: opensuse+unsubscribe@opensuse.org To contact the owner, e-mail: opensuse+owner@opensuse.org
On 2017-05-03 16:39, Peter Suetterlin wrote:
So the big question is what is wrong with the SSD RAID? Is it the number of disks, the LSI card, the size of the volume? Did anyone see similar problems before? Any input is highly welcome :)
I don't use big iron myself, but an idea: what mount options do you use? In particular, I'm thinking about the absence of "noatime". It's absence makes each "read" of a file slower, having to write the time to disk. Besides that, it seems that you should consider replacing with "lazytime". The access time is then written on bunches and when feasible, not "now". Less wear and faster, even on SSD. In any case, just post here the mount lines of all your mentioned arrays, so that people here can take those into consideration. Then, I would point you as well to the XFS mail list. They are very nice and helpful. The volume is sometimes higher because they post also PATCH mails. And the subject line lacks a list identifier. -- Cheers / Saludos, Carlos E. R. (from 42.2 x86_64 "Malachite" at Telcontar)
On 05/03/2017 11:20 AM, Carlos E. R. wrote:
I don't use big iron myself, but an idea: what mount options do you use? In particular, I'm thinking about the absence of "noatime". It's absence makes each "read" of a file slower, having to write the time to disk.
Besides that, it seems that you should consider replacing with "lazytime". The access time is then written on bunches and when feasible, not "now". Less wear and faster, even on SSD.
There's no practical use scenario for any Atime recording any more, and there never really was. Nothing depends on it. It can't serve as any audit because because WHO accessed it is not recorded. It can't serve as any caching hint because backups change it, and modern caching algorithms ignore it. You should just turn it off completely with noatime and not replace it with ANYTHING. No point in slowing your disk drive, and certainly no point in wearing out you SSDs. -- After all is said and done, more is said than done.
On 05/03/2017 07:39 AM, Peter Suetterlin wrote:
Hi,
I think there are some big-data guys around here, so I thought I look if someone has a clue on this:
I'm running a large (55TB) RAID5 set for our data acquisition system. It's 16 4TB SSDs (Samsung 850 EVO) connected to two LSI MegaRAID SAS-3 3008 cards sitting in an Asus Z170-deluxe mainboard. Disks are in JBOD mode, RAID is formed via mdadm. Filesystem is XFS.
In general it is a very nice system, but there is one ununderstandable thing: some of our cameras generate data in single files (~700k/file), and collects the files in a single directory, at 36files/s. So it is ending up with a lot of files.
The problem arrives when the data are to be deleted: Doing an rm -rf on a 700GB directory tree (several cameras, several runs, so the data is typically split in some 30-40 subdirectories) takes around 40 MINUTES.
Hi Peter, FWIW we also have a requirement to write lots of data. We use systems with SuperMicro X10DRH-iT motherboards, AVAGO (LSI) MegaRAID SAS 9361-8i RAID controllers, and two RAID-6 arrays consisting of eleven-each 6T Seagate ST6000NM0095 spinning drives configured with two dedicated hot-swap spares, in 4U SuperMicro chassis. We also use a two-SSD RAID-1 mirror for the operating system, running from the same RAID controller. We normally write thousands of 4-GB files and get about 1.6-GB/sec write rates, but I just set up a test writing 1-TB worth of 1-MB files and got a rate of about 1.5-GB/sec. I then sorted the files into nine directories and timed a "rm -r" on the lot and got 33.7-seconds. From previous experience with mdraid we found it gives significantly less performance than using hardware RAID. The controllers we use support hardware RAID-6 directly. Also note that we don't use RAID-5 due to the single-drive-failure-during-rebuild issue. We can't afford to loose any data. We haven't had a single problem with XFS. IIRC we tested with EXT4 once and found XFS to be just a bit faster. But that was years ago and memory is fading. I do remember a fatal problem with BTRFS though. It would crash, shred, and burn when writing more than 16-TB in a single partition. That also was years ago. Is there a way for you to test hardware RAID? Regards, Lew -- To unsubscribe, e-mail: opensuse+unsubscribe@opensuse.org To contact the owner, e-mail: opensuse+owner@opensuse.org
On 05/03/2017 07:39 AM, Peter Suetterlin wrote:
transport1:~ # xfs_info /dev/md0 meta-data=/dev/md0 isize=256 agcount=55, agsize=268435328 blks = sectsz=512 attr=2, projid32bit=1 = crc=0 finobt=0 spinodes=0 data = bsize=4096 blocks=14650823040, imaxpct=1 = sunit=128 swidth=1920 blks naming =version 2 bsize=4096 ascii-ci=0 ftype=1 log =internal bsize=4096 blocks=521728, version=2 = sectsz=512 sunit=8 blks, lazy-count=1 realtime =none extsz=4096 blocks=0, rtextents=0
It looks like you have atime turned off, but what about fstrim? How is this being handled? If it is done via the discard option, this can really slow things down. You might be better off handling that with a scheduled fstrim. But none of that is evident by what you have told us. -- After all is said and done, more is said than done. -- To unsubscribe, e-mail: opensuse+unsubscribe@opensuse.org To contact the owner, e-mail: opensuse+owner@opensuse.org
On 05/03/2017 11:42 AM, Lew Wolfgang wrote:
From previous experience with mdraid we found it gives significantly less performance than using hardware RAID.
My experience is the opposite, at least if your machine has adequate cable headers such that mdraid arrays each get a separate controller. The problem is that the underlying computer improves its speed with upgrades, but nobody upgrades the raid controllers, and they never get any faster. I've actually had better performance by turning off the raid software into the hardwarew of a raid controller, and just using the drive controllers as separate hardware ports for a mdraid setup. -- After all is said and done, more is said than done. -- To unsubscribe, e-mail: opensuse+unsubscribe@opensuse.org To contact the owner, e-mail: opensuse+owner@opensuse.org
On 2017-05-03 20:42, John Andersen wrote:
On 05/03/2017 11:20 AM, Carlos E. R. wrote:
I don't use big iron myself, but an idea: what mount options do you use? In particular, I'm thinking about the absence of "noatime". It's absence makes each "read" of a file slower, having to write the time to disk.
Besides that, it seems that you should consider replacing with "lazytime". The access time is then written on bunches and when feasible, not "now". Less wear and faster, even on SSD.
There's no practical use scenario for any Atime recording any more, and there never really was. Nothing depends on it.
Obviously the kernel devs think otherwise. It is going to be the new default. -- Cheers / Saludos, Carlos E. R. (from 42.2 x86_64 "Malachite" at Telcontar)
On 05/03/2017 12:10 PM, Carlos E. R. wrote:
On 2017-05-03 20:42, John Andersen wrote:
On 05/03/2017 11:20 AM, Carlos E. R. wrote:
I don't use big iron myself, but an idea: what mount options do you use? In particular, I'm thinking about the absence of "noatime". It's absence makes each "read" of a file slower, having to write the time to disk.
Besides that, it seems that you should consider replacing with "lazytime". The access time is then written on bunches and when feasible, not "now". Less wear and faster, even on SSD.
There's no practical use scenario for any Atime recording any more, and there never really was. Nothing depends on it.
Obviously the kernel devs think otherwise. It is going to be the new default.
The old default was also the default, and it too was useless, and has been since forever. Either state a valid use case, or drop this petty "appeal to authority". -- After all is said and done, more is said than done.
Peter, Something is wrong. If this was 10 years ago, I would say XFS is really slow at metadata handling. But that hasn't been true for years. How old of a kernel are you running? RE: Your on-disk log/journal Delete speed is very much affected by your log/journal optimization. Looks like you have 2GB for the log, so that seems reasonable as long as you are only journalling metadata (and no data). But it is internal, which is bad for performance. Do you have a different i/o path where you could put an external log? If so, that might free up bandwidth going to the LSI. The log gets hit really hard during heavy deletion activity, so if I were you I'd invest in a NVME PCI express card ($25) and a NVME SSD (under $100 for one way bigger than you need just for the external log). RE: other than your on disk log/journal What mount options are you using "mount | grep md0" Would you be willing to increase your RAM based log/journal buffer space (mount logbufs=8 logbsize=256k ...). XFS uses the RAM log buffer to stage and sort journal updates before they get sent to the on disk log/journal. An entire buffer is written to the ondisk journal/log as an atomic action, so the bigger the log buffer, the more efficient. The args listed are the max and that is only 2 MB of RAM for the log/journal staging area. If that is too big, keep the large logbsize, and use less logbufs. Greg -- Greg Freemyer On Wed, May 3, 2017 at 10:39 AM, Peter Suetterlin
Hi,
I think there are some big-data guys around here, so I thought I look if someone has a clue on this:
I'm running a large (55TB) RAID5 set for our data acquisition system. It's 16 4TB SSDs (Samsung 850 EVO) connected to two LSI MegaRAID SAS-3 3008 cards sitting in an Asus Z170-deluxe mainboard. Disks are in JBOD mode, RAID is formed via mdadm. Filesystem is XFS.
In general it is a very nice system, but there is one ununderstandable thing: some of our cameras generate data in single files (~700k/file), and collects the files in a single directory, at 36files/s. So it is ending up with a lot of files.
The problem arrives when the data are to be deleted: Doing an rm -rf on a 700GB directory tree (several cameras, several runs, so the data is typically split in some 30-40 subdirectories) takes around 40 MINUTES.
Now I know that XFS is not the fastest for this operation, BUT: The computer has an 'emergency RAID set', in case we run out of space. It is a 6x6TB HDD RAID5, connected to the mainboards SATA ports. Also mdadm RAID with XFS. On this (in general performance much slower) 28TB RAID, the same dataset gets deleted in around 2 minutes.
I tried (on a different computer though) 'faking' a 16-disk RAID on 4 1TB SSDs with 4 partitions each (on MB SATA ports), that one also deleted 'fast' (1.5 min)
So the big question is what is wrong with the SSD RAID? Is it the number of disks, the LSI card, the size of the volume? Did anyone see similar problems before? Any input is highly welcome :)
Here's some config info:
transport1:~ # mdadm --detail /dev/md0 /dev/md0: Version : 1.2 Creation Time : Fri Apr 28 12:06:22 2017 Raid Level : raid5 Array Size : 58603292160 (55888.45 GiB 60009.77 GB) Used Dev Size : 3906886144 (3725.90 GiB 4000.65 GB) Raid Devices : 16 Total Devices : 16 Persistence : Superblock is persistent
Intent Bitmap : Internal
Layout : left-symmetric Chunk Size : 512K
transport1:~ # xfs_info /dev/md0 meta-data=/dev/md0 isize=256 agcount=55, agsize=268435328 blks = sectsz=512 attr=2, projid32bit=1 = crc=0 finobt=0 spinodes=0 data = bsize=4096 blocks=14650823040, imaxpct=1 = sunit=128 swidth=1920 blks naming =version 2 bsize=4096 ascii-ci=0 ftype=1 log =internal bsize=4096 blocks=521728, version=2 = sectsz=512 sunit=8 blks, lazy-count=1 realtime =none extsz=4096 blocks=0, rtextents=0
-- Dr. Peter "Pit" Suetterlin http://www.astro.su.se/~pit Institute for Solar Physics Tel.: +34 922 405 590 (Spain) P.Suetterlin@royac.iac.es +46 8 5537 8559 (Sweden) Peter.Suetterlin@astro.su.se
-- To unsubscribe, e-mail: opensuse+unsubscribe@opensuse.org To contact the owner, e-mail: opensuse+owner@opensuse.org
-- To unsubscribe, e-mail: opensuse+unsubscribe@opensuse.org To contact the owner, e-mail: opensuse+owner@opensuse.org
Hi Rudi, Ruediger Meier wrote:
Raid5 is usually a bad choice for writing, especially with many small files, and especially with that many disks. You may google for "read5 write penalty".
Write speed by itself is fine, it can definitely write faster than our system can deliver data for it (bonnie++ puts it at 1.5GB/s, normal workload is 500-600MB/s).
The fact that you are using such expensive SSDs indicates that you want performance. Maybe Raid10 should be the better choice.
Mostly for storage space, plus some heat concerns with disks. But RAID10 would waste too much space...
Another thing hardware controllers may disable the write cache of your HDs by default.
Thanks for the hint - I'll investigate that.
Now I know that XFS is not the fastest for this operation, BUT: The computer has an 'emergency RAID set', in case we run out of space. It is a 6x6TB HDD RAID5, connected to the mainboards SATA ports. Also mdadm RAID with XFS. On this (in general performance much slower) 28TB RAID, the same dataset gets deleted in around 2 minutes.
AFAIR such benchmarks are only comparable if both file systems have the same size and content, or even better they are both empty, newly created. I remember that I could never reproduce my old measurements after the file system was in heavy use for some months.
Sure, but a factor of 20 difference is somewhat difficult to explain...
Very interesting. Are your SSDs officially supported by your controller? Professional controllers have usually a list of certified HD models and on the other hand they have usually more incompatibility issues than mainstream hardware. I would contact the vendor and ask about known issues.
I guess that mostly applies if you use the HW RAID of the cards - we only use them as 'SATA port multipliers'.... Cheers, Pit -- To unsubscribe, e-mail: opensuse+unsubscribe@opensuse.org To contact the owner, e-mail: opensuse+owner@opensuse.org
Carlos E. R. wrote:
On 2017-05-03 16:39, Peter Suetterlin wrote:
So the big question is what is wrong with the SSD RAID? Is it the number of disks, the LSI card, the size of the volume? Did anyone see similar problems before? Any input is highly welcome :)
I don't use big iron myself, but an idea: what mount options do you use?
Yuck, there's always some info you forget to add.
In particular, I'm thinking about the absence of "noatime". It's absence makes each "read" of a file slower, having to write the time to disk.
But noatime is explicitly specified, yes. And even if not, the default would be relatime which, following the manpage, should be almost the same...
In any case, just post here the mount lines of all your mentioned arrays, so that people here can take those into consideration.
transport1:/data/disk2/ISP # mount|egrep md[01] /dev/md0 on /data/disk1 type xfs (rw,nodev,noatime,attr2,inode64,sunit=1024,swidth=15360,noquota) /dev/md1 on /data/disk2 type xfs (rw,relatime,attr2,inode64,sunit=1024,swidth=5120,noquota) md1 is the HD one - that was added on the fly and is mounted with 'default'
Then, I would point you as well to the XFS mail list. They are very nice and helpful. The volume is sometimes higher because they post also PATCH mails. And the subject line lacks a list identifier.
Yeah, I wasn't sure (and still not am) wether this is (only) an XFS issue. And try to keep the number of subscribed lists low :o Pit -- To unsubscribe, e-mail: opensuse+unsubscribe@opensuse.org To contact the owner, e-mail: opensuse+owner@opensuse.org
On Wed, May 3, 2017 at 4:59 PM, pit
Then, I would point you as well to the XFS mail list. They are very nice and helpful. The volume is sometimes higher because they post also PATCH mails. And the subject line lacks a list identifier.
Yeah, I wasn't sure (and still not am) wether this is (only) an XFS issue. And try to keep the number of subscribed lists low :o
They allow non-subscribers to post and follow a reply-all rule so you can see all the replies. Just send an email with your questions straight to: linux-xfs@vger.kernel.org If you do that, I'm curious what the solution is, so please post it back to here as a solution. Thanks Greg -- Greg Freemyer -- To unsubscribe, e-mail: opensuse+unsubscribe@opensuse.org To contact the owner, e-mail: opensuse+owner@opensuse.org
On 05/03/2017 01:59 PM, pit wrote:
/dev/md0 on /data/disk1 type xfs (rw,nodev,noatime,attr2,inode64,sunit=1024,swidth=15360,noquota) /dev/md1 on /data/disk2 type xfs (rw,relatime,attr2,inode64,sunit=1024,swidth=5120,noquota)
md1 is the HD one - that was added on the fly and is mounted with 'default'
noatime and relatime are not the same. You are forcing the raid to run in such a way that one disk is updated differently than the other, and has different content, not in the data but in the inodes and metadata. -- After all is said and done, more is said than done. -- To unsubscribe, e-mail: opensuse+unsubscribe@opensuse.org To contact the owner, e-mail: opensuse+owner@opensuse.org
Hi Lew, thanks a lot for this quick crosscheck! Lew Wolfgang wrote:
FWIW we also have a requirement to write lots of data. We use systems with SuperMicro X10DRH-iT motherboards, AVAGO (LSI) MegaRAID SAS 9361-8i RAID controllers, and two RAID-6 arrays consisting of eleven-each 6T Seagate ST6000NM0095 spinning drives configured with two dedicated hot-swap spares, in 4U SuperMicro chassis. We also use a two-SSD RAID-1 mirror for the operating system, running from the same RAID controller.
We normally write thousands of 4-GB files and get about 1.6-GB/sec write rates, but I just set up a test writing 1-TB worth of 1-MB files and got a rate of about 1.5-GB/sec.
1.5GB/s is what bonnie++ reports for my set, too. Might be overoptimistic, but it definitely writes 500-600MB/s over hours.
I then sorted the files into nine directories and timed a "rm -r" on the lot and got 33.7-seconds.
Yes, that's about where I would like to end up. Even the 2min I got on the HDD RAID would be OK...
Is there a way for you to test hardware RAID?
Not easily. The machine is (almost) permanently loaded with data, so I have to wait for good moments if I want to change the configuration. Apart from that, we (well, it was before my time at the institute) got bitten by a HW RAID failure (broken card) where data was inaccessible using 'normal' methods. I'd have to convince my colleagues :) Pit -- To unsubscribe, e-mail: opensuse+unsubscribe@opensuse.org To contact the owner, e-mail: opensuse+owner@opensuse.org
John Andersen wrote:
On 05/03/2017 07:39 AM, Peter Suetterlin wrote:
transport1:~ # xfs_info /dev/md0 meta-data=/dev/md0 isize=256 agcount=55, agsize=268435328 blks = sectsz=512 attr=2, projid32bit=1 = crc=0 finobt=0 spinodes=0 data = bsize=4096 blocks=14650823040, imaxpct=1 = sunit=128 swidth=1920 blks naming =version 2 bsize=4096 ascii-ci=0 ftype=1 log =internal bsize=4096 blocks=521728, version=2 = sectsz=512 sunit=8 blks, lazy-count=1 realtime =none extsz=4096 blocks=0, rtextents=0
It looks like you have atime turned off, but what about fstrim? How is this being handled? If it is done via the discard option, this can really slow things down. You might be better off handling that with a scheduled fstrim.
Yes, it's mounted with noatime. No discard. /dev/md0 on /data/disk1 type xfs (rw,nodev,noatime,attr2,inode64,sunit=1024,swidth=15360,noquota) fstrim is run via cronjob/systemd weekly, but the current FS/RAID had just been created (Apr 28)
But none of that is evident by what you have told us.
My bad :( I promise to improve... Pit -- To unsubscribe, e-mail: opensuse+unsubscribe@opensuse.org To contact the owner, e-mail: opensuse+owner@opensuse.org
Hi greg, Greg Freemyer wrote:
Peter,
Something is wrong.
I agree...
If this was 10 years ago, I would say XFS is really slow at metadata handling.
But that hasn't been true for years. How old of a kernel are you running?
transport1:~ # uname -a Linux transport1 4.10.4-1-default #1 SMP PREEMPT Sat Mar 18 12:29:57 UTC 2017 (e2ef894) x86_64 x86_64 x86_64 GNU/Linux It's a Leap 42.2 machine, but with a TW kernel
RE: Your on-disk log/journal
Delete speed is very much affected by your log/journal optimization.
That is good to know!
Looks like you have 2GB for the log, so that seems reasonable as long as you are only journalling metadata (and no data).
I assume so, unless the default would be to journal data (?)
But it is internal, which is bad for performance.
Do you have a different i/o path where you could put an external log? If so, that might free up bandwidth going to the LSI.
Not really. I have 6+2x8 SATA ports, and all are used.
The log gets hit really hard during heavy deletion activity, so if I were you I'd invest in a NVME PCI express card ($25) and a NVME SSD (under $100 for one way bigger than you need just for the external log).
I guess I can save the money, the system disk is already such an NVME. I'd have to shrink some partition, but maybe for a quick test I could use the swap partition (32GB)? But that is a mkfs-time option, is it? So I cannot switch to external log without losing the data on the disks?
RE: other than your on disk log/journal
What mount options are you using "mount | grep md0"
As mentioned in other posts (sorry for omiting initially): /dev/md0 on /data/disk1 type xfs (rw,nodev,noatime,attr2,inode64,sunit=1024,swidth=15360,noquota)
Would you be willing to increase your RAM based log/journal buffer space (mount logbufs=8 logbsize=256k ...).
Sure, that should be easy. 8 logbuffs seems to be the default though... I just did that with a remount, but cannot check the effect at the moment. I'll report back later. Pit -- To unsubscribe, e-mail: opensuse+unsubscribe@opensuse.org To contact the owner, e-mail: opensuse+owner@opensuse.org
John Andersen wrote:
On 05/03/2017 01:59 PM, pit wrote:
/dev/md0 on /data/disk1 type xfs (rw,nodev,noatime,attr2,inode64,sunit=1024,swidth=15360,noquota) /dev/md1 on /data/disk2 type xfs (rw,relatime,attr2,inode64,sunit=1024,swidth=5120,noquota)
md1 is the HD one - that was added on the fly and is mounted with 'default'
noatime and relatime are not the same.
Manpage says 'similar to noatime'. If anything it should be slightly worse than the noatime, but md1 is the 'performant HDD RAID'...
You are forcing the raid to run in such a way that one disk is updated differently than the other, and has different content, not in the data but in the inodes and metadata.
Not sure if I understand what you say. You refer to the 'relatime' option? Or do you mean when comparing the two sets? I know they have different parameters, so comparing results from them is difficult... Pit -- To unsubscribe, e-mail: opensuse+unsubscribe@opensuse.org To contact the owner, e-mail: opensuse+owner@opensuse.org
Greg Freemyer wrote:
They allow non-subscribers to post and follow a reply-all rule so you can see all the replies.
Ah great! I hoped so, seeing it's on kernel.org, but the page did not state that explicitly...
Just send an email with your questions straight to: linux-xfs@vger.kernel.org
If you do that, I'm curious what the solution is, so please post it back to here as a solution.
Will do, definitely! Cheers, Pit -- To unsubscribe, e-mail: opensuse+unsubscribe@opensuse.org To contact the owner, e-mail: opensuse+owner@opensuse.org
On 2017-05-03 21:18, John Andersen wrote:
On 05/03/2017 12:10 PM, Carlos E. R. wrote:
On 2017-05-03 20:42, John Andersen wrote:
On 05/03/2017 11:20 AM, Carlos E. R. wrote:
I don't use big iron myself, but an idea: what mount options do you use? In particular, I'm thinking about the absence of "noatime". It's absence makes each "read" of a file slower, having to write the time to disk.
Besides that, it seems that you should consider replacing with "lazytime". The access time is then written on bunches and when feasible, not "now". Less wear and faster, even on SSD.
There's no practical use scenario for any Atime recording any more, and there never really was. Nothing depends on it.
Obviously the kernel devs think otherwise. It is going to be the new default.
The old default was also the default, and it too was useless, and has been since forever. Either state a valid use case, or drop this petty "appeal to authority".
Your opinion is noted. -- Cheers / Saludos, Carlos E. R. (from 42.2 x86_64 "Malachite" at Telcontar)
On 2017-05-03 23:48, pit wrote:
John Andersen wrote:
On 05/03/2017 01:59 PM, pit wrote:
/dev/md0 on /data/disk1 type xfs (rw,nodev,noatime,attr2,inode64,sunit=1024,swidth=15360,noquota) /dev/md1 on /data/disk2 type xfs (rw,relatime,attr2,inode64,sunit=1024,swidth=5120,noquota)
md1 is the HD one - that was added on the fly and is mounted with 'default'
noatime and relatime are not the same.
Manpage says 'similar to noatime'. If anything it should be slightly worse than the noatime, but md1 is the 'performant HDD RAID'...
You are forcing the raid to run in such a way that one disk is updated differently than the other, and has different content, not in the data but in the inodes and metadata.
Not sure if I understand what you say. You refer to the 'relatime' option? Or do you mean when comparing the two sets? I know they have different parameters, so comparing results from them is difficult...
It doesn't matter. The md2 array gets relatime because that is the current default, and at worst it would perform slower than noatime - which is not not case, md1 performs faster than md0. -- Cheers / Saludos, Carlos E. R. (from 42.2 x86_64 "Malachite" at Telcontar)
Pit, Changing from an internal to external log is indeed a pain. And switching back is a bigger pain. Can you try changing you i/o scheduler as a first test. I forgot about that before and you seem to be using defaults for everything. You can do this on the fly. To see what you're using: cat /sys/block/md0/queue/scheduler (I think you said md0, if not just look in /sys/block and get the right device.) If you have [cfq], get rid of that via one of these commands: echo noop > /sys/block/md0/queue/scheduler or echo deadline > /sys/block/md0/queue/scheduler I think both noop and deadline will be fine, but CFQ is a horrible choice for XFS. Greg -- Greg Freemyer -- To unsubscribe, e-mail: opensuse+unsubscribe@opensuse.org To contact the owner, e-mail: opensuse+owner@opensuse.org
Greg Freemyer wrote:
Pit,
Changing from an internal to external log is indeed a pain. And switching back is a bigger pain.
Like they say with bad-tasting medicine: If it helps.... :)
Can you try changing you i/o scheduler as a first test. I forgot about that before and you seem to be using defaults for everything.
Ah, now that you mention it: Been there, done^wtried that. transport1:~ # cat /sys/block/md0/queue/scheduler none transport1:~ # echo noop > /sys/block/md0/queue/scheduler transport1:~ # cat /sys/block/md0/queue/scheduler none No scheduler for mdraid. The disks themselves do have one though. SSDs: transport1:~ # cat /sys/block/sdg/queue/scheduler noop [deadline] cfq HDDs: transport1:~ # cat /sys/block/sda/queue/scheduler noop deadline [cfq] For them cfq is probably OK. I might try the noop for the SSDs, but first want to check the effect the increase in logbsize had... Pit -- To unsubscribe, e-mail: opensuse+unsubscribe@opensuse.org To contact the owner, e-mail: opensuse+owner@opensuse.org
On Wed, May 3, 2017 at 7:02 PM, pit
HDDs: transport1:~ # cat /sys/block/sda/queue/scheduler noop deadline [cfq]
For them cfq is probably OK. I might try the noop for the SSDs, but first want to check the effect the increase in logbsize had...
Regardless of media type, cfq is a bad choice for xfs. It has to do with the parallel threading logic inside the xfs driver. It may not matter much with your big raid, but I don't know. Greg -- Greg Freemyer -- To unsubscribe, e-mail: opensuse+unsubscribe@opensuse.org To contact the owner, e-mail: opensuse+owner@opensuse.org
On Wednesday 03 May 2017, pit wrote:
Hi Rudi,
Ruediger Meier wrote:
Raid5 is usually a bad choice for writing, especially with many small files, and especially with that many disks. You may google for "read5 write penalty".
Write speed by itself is fine, it can definitely write faster than our system can deliver data for it (bonnie++ puts it at 1.5GB/s, normal workload is 500-600MB/s).
Sequential is fast on raid5, there is no write penalty if you write a whole stripe. BTW 1.5GB/s is nothing against what you should expect if you sum-up 16x SSD speed. I guess SSD was waste of money in your case.
The fact that you are using such expensive SSDs indicates that you want performance. Maybe Raid10 should be the better choice.
Mostly for storage space, plus some heat concerns with disks. But RAID10 would waste too much space...
Regarding the costs. You are using 16 fast, large and expensive (but still not enterprise!) 4T SSDs on one of the cheapest possible mainboards. I think even in theory your raid array can't be faster than a raid10 array of cheap rotating (but certified enterprise) disks.
Another thing hardware controllers may disable the write cache of your HDs by default.
Thanks for the hint - I'll investigate that.
Now I know that XFS is not the fastest for this operation, BUT: The computer has an 'emergency RAID set', in case we run out of space. It is a 6x6TB HDD RAID5, connected to the mainboards SATA ports. Also mdadm RAID with XFS. On this (in general performance much slower) 28TB RAID, the same dataset gets deleted in around 2 minutes.
AFAIR such benchmarks are only comparable if both file systems have the same size and content, or even better they are both empty, newly created. I remember that I could never reproduce my old measurements after the file system was in heavy use for some months.
Sure, but a factor of 20 difference is somewhat difficult to explain...
Very interesting. Are your SSDs officially supported by your controller? Professional controllers have usually a list of certified HD models and on the other hand they have usually more incompatibility issues than mainstream hardware. I would contact the vendor and ask about known issues.
I guess that mostly applies if you use the HW RAID of the cards - we only use them as 'SATA port multipliers'....
No I've seen HDs which did not worked at all on a particular controller and even worse HDs which worked unstable, regardless of raid level. I've also had issues with enterprise controllers on consumer mainboards. In my cases I could solve the problems with firmware updates for controller and HDs. But this was luck. On the other hand I've never had incompatibilities with any HD on cheap onboard controllers. So I've learned my lesson. I don't mix enterprise and consumer hardware. If I would build such an expensive storage like you I would only combine *certified* combinations of mainboard,controller,HDs and operating system. cu, Rudi -- To unsubscribe, e-mail: opensuse+unsubscribe@opensuse.org To contact the owner, e-mail: opensuse+owner@opensuse.org
On 03/05/17 19:42, Lew Wolfgang wrote:
IIRC we tested with EXT4 once and found XFS to be just a bit faster. But that was years ago and memory is fading.
Not sure whether it was LWN or the linux-raid list ... There's a new variant of ext4 in the works, called "lazy-ext". And it's specifically aimed at improving raid performance. Basically, the problem is that even when streaming large data files, there's a fair bit of metadata updating going on in the background. And this triggers a lot of small, random writes. Guaranteed to make a raid controller have heartburn. So there's some tweak they're testing that consolidates all these small writes into one big stream to improve performance. I can't remember the figures, but they do look good. Cheers, Wol -- To unsubscribe, e-mail: opensuse+unsubscribe@opensuse.org To contact the owner, e-mail: opensuse+owner@opensuse.org
On Thu, May 4, 2017 at 7:52 AM, Wols Lists
On 03/05/17 19:42, Lew Wolfgang wrote:
IIRC we tested with EXT4 once and found XFS to be just a bit faster. But that was years ago and memory is fading.
Not sure whether it was LWN or the linux-raid list ...
There's a new variant of ext4 in the works, called "lazy-ext". And it's specifically aimed at improving raid performance.
Basically, the problem is that even when streaming large data files, there's a fair bit of metadata updating going on in the background. And this triggers a lot of small, random writes. Guaranteed to make a raid controller have heartburn. So there's some tweak they're testing that consolidates all these small writes into one big stream to improve performance. I can't remember the figures, but they do look good.
I'm not familiar with the ext4 work, but that sounds similar to what xfs did about 5 years ago. xfs now accumulates more metadata changes in ram before writing them to the journal. They leverage that additional buffer space to perform the equivalent of merges and elevator sorts. It had a significant impact on cutting down on the overhead associated with xfs file system overhead. Greg -- To unsubscribe, e-mail: opensuse+unsubscribe@opensuse.org To contact the owner, e-mail: opensuse+owner@opensuse.org
participants (9)
-
Carlos E. R.
-
Dave Howorth
-
Greg Freemyer
-
John Andersen
-
Lew Wolfgang
-
Peter Suetterlin
-
pit
-
Ruediger Meier
-
Wols Lists