https://bugzilla.novell.com/show_bug.cgi?id=880599 https://bugzilla.novell.com/show_bug.cgi?id=880599#c0 Summary: nfsv3 + ext3 + quotas = error -5 when attempting revoke Classification: openSUSE Product: openSUSE 13.1 Version: Final Platform: x86-64 OS/Version: openSUSE 13.1 Status: NEW Severity: Normal Priority: P5 - None Component: Network AssignedTo: bnc-team-screening@forge.provo.novell.com ReportedBy: jimc@math.ucla.edu QAContact: qa-bugs@suse.de Found By: --- Blocker: --- Versions: kernel-desktop-3.11.10-11.1.x86_64, also 3.11.6, also kernel-default nfs-client-1.2.8-4.13.1.x86_64 nfs-kernel-server-1.2.8-4.13.1.x86_64 quota-4.01-2.1.2.x86_64 quota-nfs-4.01-2.1.2.x86_64 The problem occurs when all of these elements are present: 1. NFS server exports an ext3 (not ext4) filesystem. 2. User block quotas are turned on for this filesystem. 3. Client mounts it with Nfsvers=3 (not 4) in /etc/nfsmount.conf 4. The user, executing on the client, writes a lot of stuff to a file rapidly and runs himself over quota. In production, some X-Windows app spews error messages and fills up ~/.xsession-errors. In the test scenario I do dd if=/dev/zero of=oinkage.dat bs=1M . The correct outcome is that the writer should get EDQUOT (quota exceeded) and die. Two actual outcomes are seen. In the test scenario given below, the writer gets EIO (I/O error). In the more complex production environment, the user may not receive any error -- in one case, ~/.xsession-errors had 291Gb (GIGA bytes) of garbage. But the filesystem eventually remounts itself readonly and the homedir server has to be rebooted. In no case was any corruption ever found in the filesystem. In the test scenario, for a user with 750Mb quota, in a non-failing configuration, the user will get EDQUOT when the file has 770 or 780Mb, i.e. there is some overage committed to be written when the quota system realizes that the user is over quota. In a failing configuration the overage is much greater and more variable, for example 1.4Gb. In the production environment we don't have good statistics for what error the users received (beyond EROFS readonly filesystem) and how much over quota they went, but in at least one case the quota was completely ineffective. It is not proven whether rapidness is an essential element in setting off the bug. It is possible that a process writing slowly would be equally poisonous but would have been noticed and killed by the user before he went over quota. To set off the failure: 1. These notes were done on kernel 3.11.6-4 but 3.11.10-11.1 was later tested and behaved the same. 2. Create an ext3 filesystem on the NFS server. For us it's mounted on /s1 . 3. Create /s1/aquota.user and populate it with our standard quotas. For this test the user is mathtest and his soft block quota is 500Mb, hard quota is 750Mb. (And give mount option "quota" and do "quotaon /s1".) 3. Export /s1 in /etc/exports. 4. Executing on the client, create a directory for each of 500 users containing some random files totalling about 2Mb per user. Nobody is over quota. No errors reported. This step is probably not essential but I'm just documenting what I did. 5. Executing on the client, fill up Mathtest's file rapidly. The exact command line was: su -c "dd if=/dev/zero of=/net/ocelot/s1/fs-abuse.d/mathtest/oinkage.dat bs=1M" mathtest 6. The writer invariably gets EIO (I/O error), not EDQUOT (quota exceeded). 7. Not a peep in syslog. I'm assuming without proof that if I had continued to abuse the filesystem, or done some different sequence of I/O operations, I would have gotten the filesystem to remount readonly. The following information comes from the production failures. The home directory filesystems are on RAID-5 with Megaraid cards, formatted with ext3. Only two particular filesystems on two servers remount themselves readonly. When this happens, ~/.xsession-errors for two users (one per filesystem) is found to be bloated due to an error message loop in an X-Windows app they use. The problem began immediately after the servers were upgraded from OpenSuSE 11.4 with kernel 2.6.37, to OpenSuSE 13.1 with kernel 3.11.10-7.1. (Later downgraded to 3.11.6-4, didn't help.) See the attachment for syslog messages upon one of the remounts. Workarounds: 1. Fix the malfunctioning app so users don't go over quota. 2. Disable quotas on the affected filesystems. 3. Upgrade from ext3 to ext4 (big job). 4. Use the default NFS protocol of NFSv4. Bugfixes in 3.11.10-11.1 apparently have resolved the idmapd upcall problems that were messing us up formerly, and also, finishing upgrades to SuSE 13.1 make the name translation problems a non-issue. This is the workaround we are currently using. Success so far, cross fingers. -- Configure bugmail: https://bugzilla.novell.com/userprefs.cgi?tab=email ------- You are receiving this mail because: ------- You are on the CC list for the bug.