[opensuse] RAID/XFS performance question

newer
[opensuse] removable media - again

Peter Suetterlin

3 May 2017 3 May '17

14:39

Hi, I think there are some big-data guys around here, so I thought I look if someone has a clue on this: I'm running a large (55TB) RAID5 set for our data acquisition system. It's 16 4TB SSDs (Samsung 850 EVO) connected to two LSI MegaRAID SAS-3 3008 cards sitting in an Asus Z170-deluxe mainboard. Disks are in JBOD mode, RAID is formed via mdadm. Filesystem is XFS. In general it is a very nice system, but there is one ununderstandable thing: some of our cameras generate data in single files (~700k/file), and collects the files in a single directory, at 36files/s. So it is ending up with a lot of files. The problem arrives when the data are to be deleted: Doing an rm -rf on a 700GB directory tree (several cameras, several runs, so the data is typically split in some 30-40 subdirectories) takes around 40 MINUTES. Now I know that XFS is not the fastest for this operation, BUT: The computer has an 'emergency RAID set', in case we run out of space. It is a 6x6TB HDD RAID5, connected to the mainboards SATA ports. Also mdadm RAID with XFS. On this (in general performance much slower) 28TB RAID, the same dataset gets deleted in around 2 minutes. I tried (on a different computer though) 'faking' a 16-disk RAID on 4 1TB SSDs with 4 partitions each (on MB SATA ports), that one also deleted 'fast' (1.5 min) So the big question is what is wrong with the SSD RAID? Is it the number of disks, the LSI card, the size of the volume? Did anyone see similar problems before? Any input is highly welcome :) Here's some config info: transport1:~ # mdadm --detail /dev/md0 /dev/md0: Version : 1.2 Creation Time : Fri Apr 28 12:06:22 2017 Raid Level : raid5 Array Size : 58603292160 (55888.45 GiB 60009.77 GB) Used Dev Size : 3906886144 (3725.90 GiB 4000.65 GB) Raid Devices : 16 Total Devices : 16 Persistence : Superblock is persistent Intent Bitmap : Internal Layout : left-symmetric Chunk Size : 512K transport1:~ # xfs_info /dev/md0 meta-data=/dev/md0 isize=256 agcount=55, agsize=268435328 blks = sectsz=512 attr=2, projid32bit=1 = crc=0 finobt=0 spinodes=0 data = bsize=4096 blocks=14650823040, imaxpct=1 = sunit=128 swidth=1920 blks naming =version 2 bsize=4096 ascii-ci=0 ftype=1 log =internal bsize=4096 blocks=521728, version=2 = sectsz=512 sunit=8 blks, lazy-count=1 realtime =none extsz=4096 blocks=0, rtextents=0 -- Dr. Peter "Pit" Suetterlin http://www.astro.su.se/~pit Institute for Solar Physics Tel.: +34 922 405 590 (Spain) P.Suetterlin@royac.iac.es +46 8 5537 8559 (Sweden) Peter.Suetterlin@astro.su.se -- To unsubscribe, e-mail: opensuse+unsubscribe@opensuse.org To contact the owner, e-mail: opensuse+owner@opensuse.org

Show replies by thread

Dave Howorth

3 May 3 May

15:35

On Wed, 3 May 2017 15:39:28 +0100 Peter Suetterlin wrote:

...

I think there are some big-data guys around here, so I thought I look if someone has a clue on this

You may get lucky, but there are far more big-data guys around on the XFS mailing list and what's more they have unparalled experience of tweaking XFS configs. Don't forget to read their 'suggestions' for how to report problems. HTH, Dave -- To unsubscribe, e-mail: opensuse+unsubscribe@opensuse.org To contact the owner, e-mail: opensuse+owner@opensuse.org

Ruediger Meier

16:34

Hi Peter On Wednesday 03 May 2017, Peter Suetterlin wrote:

...

Hi,

I think there are some big-data guys around here, so I thought I look if someone has a clue on this:

I'm running a large (55TB) RAID5 set for our data acquisition system. It's 16 4TB SSDs (Samsung 850 EVO) connected to two LSI MegaRAID SAS-3 3008 cards sitting in an Asus Z170-deluxe mainboard. Disks are in JBOD mode, RAID is formed via mdadm. Filesystem is XFS.

In general it is a very nice system, but there is one ununderstandable thing: some of our cameras generate data in single files (~700k/file), and collects the files in a single directory, at 36files/s. So it is ending up with a lot of files.

Raid5 is usually a bad choice for writing, especially with many small files, and especially with that many disks. You may google for "read5 write penalty". The fact that you are using such expensive SSDs indicates that you want performance. Maybe Raid10 should be the better choice. Regarding XFS, I've switched from XFS to EXT4 many years ago because EXT4 was dozens of times faster for creating or deleting many small files. I've heard that XFS has been improved simce then but I've never tested it again. Your particular number "36files/s" on XFS is something which sounds very familar to me. Another thing hardware controllers may disable the write cache of your HDs by default.

...

The problem arrives when the data are to be deleted: Doing an rm -rf on a 700GB directory tree (several cameras, several runs, so the data is typically split in some 30-40 subdirectories) takes around 40 MINUTES.

Now I know that XFS is not the fastest for this operation, BUT: The computer has an 'emergency RAID set', in case we run out of space. It is a 6x6TB HDD RAID5, connected to the mainboards SATA ports. Also mdadm RAID with XFS. On this (in general performance much slower) 28TB RAID, the same dataset gets deleted in around 2 minutes.

AFAIR such benchmarks are only comparable if both file systems have the same size and content, or even better they are both empty, newly created. I remember that I could never reproduce my old measurements after the file system was in heavy use for some months.

...

I tried (on a different computer though) 'faking' a 16-disk RAID on 4 1TB SSDs with 4 partitions each (on MB SATA ports), that one also deleted 'fast' (1.5 min)

...

So the big question is what is wrong with the SSD RAID? Is it the number of disks, the LSI card, the size of the volume? Did anyone see similar problems before? Any input is highly welcome :)

Very interesting. Are your SSDs officially supported by your controller? Professional controllers have usually a list of certified HD models and on the other hand they have usually more incompatibility issues than mainstream hardware. I would contact the vendor and ask about known issues. cu, Rudi

...

Here's some config info:

transport1:~ # mdadm --detail /dev/md0 /dev/md0: Version : 1.2 Creation Time : Fri Apr 28 12:06:22 2017 Raid Level : raid5 Array Size : 58603292160 (55888.45 GiB 60009.77 GB) Used Dev Size : 3906886144 (3725.90 GiB 4000.65 GB) Raid Devices : 16 Total Devices : 16 Persistence : Superblock is persistent

Intent Bitmap : Internal

Layout : left-symmetric Chunk Size : 512K

transport1:~ # xfs_info /dev/md0 meta-data=/dev/md0 isize=256 agcount=55, agsize=268435328 blks = sectsz=512 attr=2, projid32bit=1 = crc=0 finobt=0 spinodes=0 data = bsize=4096 blocks=14650823040, imaxpct=1 = sunit=128 swidth=1920 blks naming =version 2 bsize=4096 ascii-ci=0 ftype=1 log =internal bsize=4096 blocks=521728, version=2 = sectsz=512 sunit=8 blks, lazy-count=1 realtime =none extsz=4096 blocks=0, rtextents=0

-- Dr. Peter "Pit" Suetterlin http://www.astro.su.se/~pit Institute for Solar Physics Tel.: +34 922 405 590 (Spain) P.Suetterlin@royac.iac.es +46 8 5537 8559 (Sweden) Peter.Suetterlin@astro.su.se

-- To unsubscribe, e-mail: opensuse+unsubscribe@opensuse.org To contact the owner, e-mail: opensuse+owner@opensuse.org

Ruediger Meier

16:48

On Wednesday 03 May 2017, Peter Suetterlin wrote:

...

Hi,

I think there are some big-data guys around here, so I thought I look if someone has a clue on this:

I'm running a large (55TB) RAID5 set for our data acquisition system. It's 16 4TB SSDs (Samsung 850 EVO) connected to two LSI MegaRAID SAS-3 3008 cards sitting in an Asus Z170-deluxe mainboard.

BTW one more thought. It looks like you've spend 20 -25 thousand EUR for HDs and controllers but you are using this on a cheap consumer mainboard. Maybe you should have invested 3-5 thousand EUR more to buy a whole server system designed and manufactured by a professional vendor.

...

Disks are in JBOD mode, RAID is formed via mdadm. Filesystem is XFS.

In general it is a very nice system, but there is one ununderstandable thing: some of our cameras generate data in single files (~700k/file), and collects the files in a single directory, at 36files/s. So it is ending up with a lot of files.

The problem arrives when the data are to be deleted: Doing an rm -rf on a 700GB directory tree (several cameras, several runs, so the data is typically split in some 30-40 subdirectories) takes around 40 MINUTES.

Now I know that XFS is not the fastest for this operation, BUT: The computer has an 'emergency RAID set', in case we run out of space. It is a 6x6TB HDD RAID5, connected to the mainboards SATA ports. Also mdadm RAID with XFS. On this (in general performance much slower) 28TB RAID, the same dataset gets deleted in around 2 minutes.

I tried (on a different computer though) 'faking' a 16-disk RAID on 4 1TB SSDs with 4 partitions each (on MB SATA ports), that one also deleted 'fast' (1.5 min)

So the big question is what is wrong with the SSD RAID? Is it the number of disks, the LSI card, the size of the volume? Did anyone see similar problems before? Any input is highly welcome :)

Here's some config info:

transport1:~ # mdadm --detail /dev/md0 /dev/md0: Version : 1.2 Creation Time : Fri Apr 28 12:06:22 2017 Raid Level : raid5 Array Size : 58603292160 (55888.45 GiB 60009.77 GB) Used Dev Size : 3906886144 (3725.90 GiB 4000.65 GB) Raid Devices : 16 Total Devices : 16 Persistence : Superblock is persistent

Intent Bitmap : Internal

Layout : left-symmetric Chunk Size : 512K

transport1:~ # xfs_info /dev/md0 meta-data=/dev/md0 isize=256 agcount=55, agsize=268435328 blks = sectsz=512 attr=2, projid32bit=1 = crc=0 finobt=0 spinodes=0 data = bsize=4096 blocks=14650823040, imaxpct=1 = sunit=128 swidth=1920 blks naming =version 2 bsize=4096 ascii-ci=0 ftype=1 log =internal bsize=4096 blocks=521728, version=2 = sectsz=512 sunit=8 blks, lazy-count=1 realtime =none extsz=4096 blocks=0, rtextents=0

-- Dr. Peter "Pit" Suetterlin http://www.astro.su.se/~pit Institute for Solar Physics Tel.: +34 922 405 590 (Spain) P.Suetterlin@royac.iac.es +46 8 5537 8559 (Sweden) Peter.Suetterlin@astro.su.se

-- To unsubscribe, e-mail: opensuse+unsubscribe@opensuse.org To contact the owner, e-mail: opensuse+owner@opensuse.org

Carlos E. R.

18:20

On 2017-05-03 16:39, Peter Suetterlin wrote:

...

So the big question is what is wrong with the SSD RAID? Is it the number of disks, the LSI card, the size of the volume? Did anyone see similar problems before? Any input is highly welcome :)

I don't use big iron myself, but an idea: what mount options do you use? In particular, I'm thinking about the absence of "noatime". It's absence makes each "read" of a file slower, having to write the time to disk. Besides that, it seems that you should consider replacing with "lazytime". The access time is then written on bunches and when feasible, not "now". Less wear and faster, even on SSD. In any case, just post here the mount lines of all your mentioned arrays, so that people here can take those into consideration. Then, I would point you as well to the XFS mail list. They are very nice and helpful. The volume is sometimes higher because they post also PATCH mails. And the subject line lacks a list identifier. -- Cheers / Saludos, Carlos E. R. (from 42.2 x86_64 "Malachite" at Telcontar)

John Andersen

18:42

On 05/03/2017 11:20 AM, Carlos E. R. wrote:

...

I don't use big iron myself, but an idea: what mount options do you use? In particular, I'm thinking about the absence of "noatime". It's absence makes each "read" of a file slower, having to write the time to disk.

Besides that, it seems that you should consider replacing with "lazytime". The access time is then written on bunches and when feasible, not "now". Less wear and faster, even on SSD.

There's no practical use scenario for any Atime recording any more, and there never really was. Nothing depends on it. It can't serve as any audit because because WHO accessed it is not recorded. It can't serve as any caching hint because backups change it, and modern caching algorithms ignore it. You should just turn it off completely with noatime and not replace it with ANYTHING. No point in slowing your disk drive, and certainly no point in wearing out you SSDs. -- After all is said and done, more is said than done.