Re: [opensuse] RAID/XFS performance question

3 May 2017

      Hi Peter

On Wednesday 03 May 2017, Peter Suetterlin wrote:
...
Hi,
I think there are some big-data guys around here, so I thought I look
if someone has a clue on this:
I'm running a large (55TB) RAID5 set for our data acquisition system.
 It's 16 4TB SSDs (Samsung 850 EVO) connected to two LSI MegaRAID
SAS-3 3008 cards sitting in an Asus Z170-deluxe mainboard.  Disks are
in JBOD mode, RAID is formed via mdadm. Filesystem is XFS.
In general it is a very nice system, but there is one
ununderstandable thing: some of our cameras generate data in single
files (~700k/file), and collects the files in a single directory, at
36files/s.  So it is ending up with a lot of files.
Raid5 is usually a bad choice for writing, especially with many small 
files, and especially with that many disks. You may google for "read5 
write penalty".

The fact that you are using such expensive SSDs indicates that you want 
performance. Maybe Raid10 should be the better choice.

Regarding XFS, I've switched from XFS to EXT4 many years ago because 
EXT4 was dozens of times faster for creating or deleting many small 
files. I've heard that XFS has been improved simce then but I've never 
tested it again. Your particular number "36files/s" on XFS is something 
which sounds very familar to me.

Another thing hardware controllers may disable the write cache of your 
HDs by default.
...
The problem arrives when the data are to be deleted: Doing an rm -rf
on a 700GB directory tree (several cameras, several runs, so the data
is typically split in some 30-40 subdirectories) takes around 40
MINUTES.
Now I know that XFS is not the fastest for this operation, BUT:
The computer has an 'emergency RAID set', in case we run out of
space. It is a 6x6TB HDD RAID5, connected to the mainboards SATA
ports.  Also mdadm RAID with XFS.  On this (in general performance
much slower) 28TB RAID, the same dataset gets deleted in around 2
minutes.
AFAIR such benchmarks are only comparable if both file systems have the 
same size and content, or even better they are both empty, newly 
created. I remember that I could never reproduce my old measurements 
after the file system was in heavy use for some months.
...
I tried (on a different computer though) 'faking' a 16-disk RAID on 4
1TB SSDs with 4 partitions each (on MB SATA ports), that one also
deleted 'fast' (1.5 min)

...
So the big question is what is wrong with the SSD RAID?  Is it the
number of disks, the LSI card, the size of the volume?  Did anyone
see similar problems before?  Any input is highly welcome :)
Very interesting. Are your SSDs officially supported by your controller? 
Professional controllers have usually a list of certified HD models and 
on the other hand they have usually more incompatibility issues than 
mainstream hardware. I would contact the vendor and ask about known 
issues.

cu,
Rudi
...
Here's some config info:
transport1:~ # mdadm --detail /dev/md0
/dev/md0:
        Version : 1.2
  Creation Time : Fri Apr 28 12:06:22 2017
     Raid Level : raid5
     Array Size : 58603292160 (55888.45 GiB 60009.77 GB)
  Used Dev Size : 3906886144 (3725.90 GiB 4000.65 GB)
   Raid Devices : 16
  Total Devices : 16
    Persistence : Superblock is persistent
Intent Bitmap : Internal
Layout : left-symmetric
     Chunk Size : 512K
transport1:~ # xfs_info /dev/md0
meta-data=/dev/md0               isize=256    agcount=55,
agsize=268435328 blks =                       sectsz=512   attr=2,
projid32bit=1 =                       crc=0        finobt=0
spinodes=0 data     =                       bsize=4096  
blocks=14650823040, imaxpct=1 =                       sunit=128   
swidth=1920 blks naming   =version 2              bsize=4096  
ascii-ci=0 ftype=1 log      =internal               bsize=4096  
blocks=521728, version=2 =                       sectsz=512   sunit=8
blks, lazy-count=1 realtime =none                   extsz=4096  
blocks=0, rtextents=0
--
Dr. Peter "Pit" Suetterlin                
http://www.astro.su.se/~pit Institute for Solar Physics
Tel.: +34 922 405 590  (Spain)             P.Suetterlin@royac.iac.es
      +46 8 5537 8559  (Sweden)           
Peter.Suetterlin@astro.su.se
-- 
To unsubscribe, e-mail: opensuse+unsubscribe@opensuse.org
To contact the owner, e-mail: opensuse+owner@opensuse.org

Re: [opensuse] RAID/XFS performance question

Ruediger Meier