Hi Peter On Wednesday 03 May 2017, Peter Suetterlin wrote:
Hi,
I think there are some big-data guys around here, so I thought I look if someone has a clue on this:
I'm running a large (55TB) RAID5 set for our data acquisition system. It's 16 4TB SSDs (Samsung 850 EVO) connected to two LSI MegaRAID SAS-3 3008 cards sitting in an Asus Z170-deluxe mainboard. Disks are in JBOD mode, RAID is formed via mdadm. Filesystem is XFS.
In general it is a very nice system, but there is one ununderstandable thing: some of our cameras generate data in single files (~700k/file), and collects the files in a single directory, at 36files/s. So it is ending up with a lot of files.
Raid5 is usually a bad choice for writing, especially with many small files, and especially with that many disks. You may google for "read5 write penalty". The fact that you are using such expensive SSDs indicates that you want performance. Maybe Raid10 should be the better choice. Regarding XFS, I've switched from XFS to EXT4 many years ago because EXT4 was dozens of times faster for creating or deleting many small files. I've heard that XFS has been improved simce then but I've never tested it again. Your particular number "36files/s" on XFS is something which sounds very familar to me. Another thing hardware controllers may disable the write cache of your HDs by default.
The problem arrives when the data are to be deleted: Doing an rm -rf on a 700GB directory tree (several cameras, several runs, so the data is typically split in some 30-40 subdirectories) takes around 40 MINUTES.
Now I know that XFS is not the fastest for this operation, BUT: The computer has an 'emergency RAID set', in case we run out of space. It is a 6x6TB HDD RAID5, connected to the mainboards SATA ports. Also mdadm RAID with XFS. On this (in general performance much slower) 28TB RAID, the same dataset gets deleted in around 2 minutes.
AFAIR such benchmarks are only comparable if both file systems have the same size and content, or even better they are both empty, newly created. I remember that I could never reproduce my old measurements after the file system was in heavy use for some months.
I tried (on a different computer though) 'faking' a 16-disk RAID on 4 1TB SSDs with 4 partitions each (on MB SATA ports), that one also deleted 'fast' (1.5 min)
So the big question is what is wrong with the SSD RAID? Is it the number of disks, the LSI card, the size of the volume? Did anyone see similar problems before? Any input is highly welcome :)
Very interesting. Are your SSDs officially supported by your controller? Professional controllers have usually a list of certified HD models and on the other hand they have usually more incompatibility issues than mainstream hardware. I would contact the vendor and ask about known issues. cu, Rudi
Here's some config info:
transport1:~ # mdadm --detail /dev/md0 /dev/md0: Version : 1.2 Creation Time : Fri Apr 28 12:06:22 2017 Raid Level : raid5 Array Size : 58603292160 (55888.45 GiB 60009.77 GB) Used Dev Size : 3906886144 (3725.90 GiB 4000.65 GB) Raid Devices : 16 Total Devices : 16 Persistence : Superblock is persistent
Intent Bitmap : Internal
Layout : left-symmetric Chunk Size : 512K
transport1:~ # xfs_info /dev/md0 meta-data=/dev/md0 isize=256 agcount=55, agsize=268435328 blks = sectsz=512 attr=2, projid32bit=1 = crc=0 finobt=0 spinodes=0 data = bsize=4096 blocks=14650823040, imaxpct=1 = sunit=128 swidth=1920 blks naming =version 2 bsize=4096 ascii-ci=0 ftype=1 log =internal bsize=4096 blocks=521728, version=2 = sectsz=512 sunit=8 blks, lazy-count=1 realtime =none extsz=4096 blocks=0, rtextents=0
-- Dr. Peter "Pit" Suetterlin http://www.astro.su.se/~pit Institute for Solar Physics Tel.: +34 922 405 590 (Spain) P.Suetterlin@royac.iac.es +46 8 5537 8559 (Sweden) Peter.Suetterlin@astro.su.se
-- To unsubscribe, e-mail: opensuse+unsubscribe@opensuse.org To contact the owner, e-mail: opensuse+owner@opensuse.org