[Bug 880599] New: nfsv3 + ext3 + quotas = error -5 when attempting revoke

29 May 2014

      https://bugzilla.novell.com/show_bug.cgi?id=880599

https://bugzilla.novell.com/show_bug.cgi?id=880599#c0

           Summary: nfsv3 + ext3 + quotas = error -5 when attempting
                    revoke
    Classification: openSUSE
           Product: openSUSE 13.1
           Version: Final
          Platform: x86-64
        OS/Version: openSUSE 13.1
            Status: NEW
          Severity: Normal
          Priority: P5 - None
         Component: Network
        AssignedTo: bnc-team-screening@forge.provo.novell.com
        ReportedBy: jimc@math.ucla.edu
         QAContact: qa-bugs@suse.de
          Found By: ---
           Blocker: ---

Versions: 
kernel-desktop-3.11.10-11.1.x86_64, also 3.11.6, also kernel-default
nfs-client-1.2.8-4.13.1.x86_64
nfs-kernel-server-1.2.8-4.13.1.x86_64
quota-4.01-2.1.2.x86_64
quota-nfs-4.01-2.1.2.x86_64

The problem occurs when all of these elements are present:
1.  NFS server exports an ext3 (not ext4) filesystem.
2.  User block quotas are turned on for this filesystem.
3.  Client mounts it with Nfsvers=3 (not 4) in /etc/nfsmount.conf
4.  The user, executing on the client, writes a lot of stuff to a file 
    rapidly and runs himself over quota.  In production, some X-Windows 
    app spews error messages and fills up ~/.xsession-errors.  In the 
    test scenario I do dd if=/dev/zero of=oinkage.dat bs=1M .

The correct outcome is that the writer should get EDQUOT (quota exceeded)
and die.  Two actual outcomes are seen.  In the test scenario given
below, the writer gets EIO (I/O error).  In the more complex production
environment, the user may not receive any error -- in one case, 
~/.xsession-errors had 291Gb (GIGA bytes) of garbage.  But the
filesystem eventually remounts itself readonly and the homedir server 
has to be rebooted.  In no case was any corruption ever found in the 
filesystem.

In the test scenario, for a user with 750Mb quota, in a non-failing
configuration, the user will get EDQUOT when the file has 770 or 780Mb,
i.e. there is some overage committed to be written when the quota
system realizes that the user is over quota.  In a failing 
configuration the overage is much greater and more variable, for 
example 1.4Gb.  In the production environment we don't have good
statistics for what error the users received (beyond EROFS readonly
filesystem) and how much over quota they went, but in at least one
case the quota was completely ineffective.  

It is not proven whether rapidness is an essential element in setting
off the bug.  It is possible that a process writing slowly would be
equally poisonous but would have been noticed and killed by the user 
before he went over quota.  

To set off the failure:
1.  These notes were done on kernel 3.11.6-4 but 3.11.10-11.1 was 
    later tested and behaved the same.  
2.  Create an ext3 filesystem on the NFS server.  For us it's 
    mounted on /s1 .
3.  Create /s1/aquota.user and populate it with our standard quotas. 
    For this test the user is mathtest and his soft block quota is 
    500Mb, hard quota is 750Mb.  (And give mount option "quota" and do
    "quotaon /s1".)
3.  Export /s1 in /etc/exports.
4.  Executing on the client, create a directory for each of 500 users
    containing some random files totalling about 2Mb per user.  Nobody
    is over quota.  No errors reported.  This step is probably not
    essential but I'm just documenting what I did.  
5.  Executing on the client, fill up Mathtest's file rapidly.  The 
    exact command line was:
    su -c "dd if=/dev/zero of=/net/ocelot/s1/fs-abuse.d/mathtest/oinkage.dat
bs=1M" mathtest
6.  The writer invariably gets EIO (I/O error), not EDQUOT (quota 
    exceeded).  
7.  Not a peep in syslog.

I'm assuming without proof that if I had continued to abuse the 
filesystem, or done some different sequence of I/O operations, I 
would have gotten the filesystem to remount readonly.  The following 
information comes from the production failures.  

The home directory filesystems are on RAID-5 with Megaraid cards,
formatted with ext3.  Only two particular filesystems on two servers
remount themselves readonly.  When this happens, ~/.xsession-errors for
two users (one per filesystem) is found to be bloated due to an error
message loop in an X-Windows app they use.  The problem began
immediately after the servers were upgraded from OpenSuSE 11.4 with
kernel 2.6.37, to OpenSuSE 13.1 with kernel 3.11.10-7.1.  (Later
downgraded to 3.11.6-4, didn't help.)  See the attachment for syslog
messages upon one of the remounts.  

Workarounds:
1.  Fix the malfunctioning app so users don't go over quota. 
2.  Disable quotas on the affected filesystems.
3.  Upgrade from ext3 to ext4 (big job).
4.  Use the default NFS protocol of NFSv4.  Bugfixes in 3.11.10-11.1
    apparently have resolved the idmapd upcall problems that were 
    messing us up formerly, and also, finishing upgrades to SuSE 13.1
    make the name translation problems a non-issue.  This is the 
    workaround we are currently using.  Success so far, cross fingers.

-- 
Configure bugmail: https://bugzilla.novell.com/userprefs.cgi?tab=email
------- You are receiving this mail because: -------
You are on the CC list for the bug.

[Bug 880599] New: nfsv3 + ext3 + quotas = error -5 when attempting revoke

bugzilla_noreply＠novell.com