[opensuse-programming] ext3 - ENOMEM on file write
Greetings, Wondering if anyone here has some experience with this. I have been trying to track this down and have no idea how this is happening. Its not making any sense at all. We have a case where we have a system that is doing some successive reading and writing to a file. Here is the specific Code: --snip-- First, we're given an inode 'ainode', which should be the correct inode for the file we're looking at. (If it were incorrect, we would have gotten an error much earlier.) If we have iget, we call iget. The 2.6.16.60-* kernels lack iget, I believe, so instead we do: fid.i32.ino = ainode; fid.i32.gen = 0; dp = afs_cacheSBp->s_export_op->fh_to_dentry(afs_cacheSBp, &fid, sizeof(fid), FILEID_INO32_GEN); filp = dentry_open(dp, mntget(afs_cacheMnt), O_RDWR); (I'm not including error-checking. afs_cacheSBp is the superblock for the cache filesystem, so ext2 or ext3. afs_cacheMnt is the vfsmount for the cache FS) Then, to write to the file we do basically: (we're writing 'count' bytes from 'buf' to 'offset' in file 'filp'; I'm paraphrasing the code here, but I think I'm maintaining what it does) mm_segment_t _fs_space_decl; int code = 0; savelim = current->TASK_STRUCT_RLIM[RLIMIT_FSIZE].rlim_cur; current->TASK_STRUCT_RLIM[RLIMIT_FSIZE].rlim_cur = RLIM_INFINITY; _fs_space_decl = get_fs(); set_fs(get_ds()); if (filp->f_op->llseek) { if (filp->f_op->llseek(filp, offset, 0) != offset) return -1; } else { filp->f_pos = offset; } while (code == 0 && count > 0) { code = filp->f_op->write(filp, buf, count, &f->f_pos); if (code < 0) { code = -code; break; } else if (code == 0) { code = EIO; break; } buf += code; count -= code; code = 0; } set_fs(_fs_space_decl); current->TASK_STRUCT_RLIM[RLIMIT_FSIZE].rlim_cur = savelim; return code; It is here we are getting -ENOMEM from filp->f_op->write. Finally, normally to close a file, we do this: if (filp->f_dentry->d_inode) { filp_close(filp, NULL); } --snip-- It's not intended to be written to constantly, but in this case it probably is written to several times successively (due to certain parameters set a bit low, and the high load for these clients). I believe this function was called about 644203 times in the core I'm looking at, which means that file was written to at least around 644000 times... I'm assuming at least most of those were right after another. However, there would almost always be several reads of the same file between successive writes. (Again, in an 'open(); read(); close();' fashion) But they are probably all happening very quickly; I assume the cache for the stuff in this file is thrashing. How can we avoid this or make this much more stream-lined? Any help would be appreciated. Thanks, Cameron -- To unsubscribe, e-mail: opensuse-programming+unsubscribe@opensuse.org For additional commands, e-mail: opensuse-programming+help@opensuse.org
On Thu, 2010-04-29 at 18:41 -0600, Cameron Seader wrote:
Greetings, Wondering if anyone here has some experience with this. I have been trying to track this down and have no idea how this is happening. Its not making any sense at all.
We have a case where we have a system that is doing some successive reading and writing to a file.
Here is the specific Code:
--snip-- First, we're given an inode 'ainode', which should be the correct inode for the file we're looking at. (If it were incorrect, we would have gotten an error much earlier.)
If we have iget, we call iget. The 2.6.16.60-* kernels lack iget, I believe, so instead we do:
fid.i32.ino = ainode; fid.i32.gen = 0; dp = afs_cacheSBp->s_export_op->fh_to_dentry(afs_cacheSBp, &fid, sizeof(fid), FILEID_INO32_GEN); filp = dentry_open(dp, mntget(afs_cacheMnt), O_RDWR);
(I'm not including error-checking. afs_cacheSBp is the superblock for the cache filesystem, so ext2 or ext3. afs_cacheMnt is the vfsmount for the cache FS) Then, to write to the file we do basically:
(we're writing 'count' bytes from 'buf' to 'offset' in file 'filp'; I'm paraphrasing the code here, but I think I'm maintaining what it does)
mm_segment_t _fs_space_decl; int code = 0;
savelim = current->TASK_STRUCT_RLIM[RLIMIT_FSIZE].rlim_cur; current->TASK_STRUCT_RLIM[RLIMIT_FSIZE].rlim_cur = RLIM_INFINITY;
_fs_space_decl = get_fs(); set_fs(get_ds());
if (filp->f_op->llseek) { if (filp->f_op->llseek(filp, offset, 0) != offset) return -1; } else { filp->f_pos = offset; }
while (code == 0 && count > 0) { code = filp->f_op->write(filp, buf, count, &f->f_pos); if (code < 0) { code = -code; break; } else if (code == 0) { code = EIO; break; }
buf += code; count -= code; code = 0; }
If I understand your code, you are not using the return value from write() properly. write() returns how many characters were written, or -1 on error. It never ever contains any error information beyond that. When the return value does not match 'count', then you must use 'errno' to see what the problem was. The action to take beyond that depends on if the file descriptor is blocking or not. For a disk file, and looking at your sample code, your writes are blocking. So the only two possible return values from write() are -1 and 'count'. If you get -1. then you should set code to 'errno'. -- Roger Oberholtzer OPQ Systems / Ramböll RST Ramböll Sverige AB Krukmakargatan 21 P.O. Box 17009 SE-104 62 Stockholm, Sweden Office: Int +46 10-615 60 20 Mobile: Int +46 70-815 1696 -- To unsubscribe, e-mail: opensuse-programming+unsubscribe@opensuse.org For additional commands, e-mail: opensuse-programming+help@opensuse.org
That sounds like a description of the write() system call, called from userspace. We are not calling the write() syscall, we are calling ext3's f_op->write function from within the kernel. Per the linux kernel's conventions that have been around since forever as far as I know, error codes are returned as -errno. Which certainly seems to be what we're seeing here, as we are rather clearly getting the value -12 returned from the f_op->write call. Any other thoughts?? -Cameron
On 4/30/2010 at 06:27 AM, in message <1272630423.7641.121.camel@acme.pacific>, Roger Oberholtzer <roger@opq.se> wrote: On Thu, 2010-04-29 at 18:41 -0600, Cameron Seader wrote: Greetings, Wondering if anyone here has some experience with this. I have been trying to track this down and have no idea how this is happening. Its not making any sense at all.
We have a case where we have a system that is doing some successive reading and writing to a file.
Here is the specific Code:
--snip-- First, we're given an inode 'ainode', which should be the correct inode for the file we're looking at. (If it were incorrect, we would have gotten an error much earlier.)
If we have iget, we call iget. The 2.6.16.60-* kernels lack iget, I believe, so instead we do:
fid.i32.ino = ainode; fid.i32.gen = 0; dp = afs_cacheSBp->s_export_op->fh_to_dentry(afs_cacheSBp, &fid, sizeof(fid), FILEID_INO32_GEN); filp = dentry_open(dp, mntget(afs_cacheMnt), O_RDWR);
(I'm not including error-checking. afs_cacheSBp is the superblock for the cache filesystem, so ext2 or ext3. afs_cacheMnt is the vfsmount for the cache FS) Then, to write to the file we do basically:
(we're writing 'count' bytes from 'buf' to 'offset' in file 'filp'; I'm paraphrasing the code here, but I think I'm maintaining what it does)
mm_segment_t _fs_space_decl; int code = 0;
savelim = current->TASK_STRUCT_RLIM[RLIMIT_FSIZE].rlim_cur; current->TASK_STRUCT_RLIM[RLIMIT_FSIZE].rlim_cur = RLIM_INFINITY;
_fs_space_decl = get_fs(); set_fs(get_ds());
if (filp->f_op->llseek) { if (filp->f_op->llseek(filp, offset, 0) != offset) return -1; } else { filp->f_pos = offset; }
while (code == 0 && count > 0) { code = filp->f_op->write(filp, buf, count, &f->f_pos); if (code < 0) { code = -code; break; } else if (code == 0) { code = EIO; break; }
buf += code; count -= code; code = 0; }
If I understand your code, you are not using the return value from write() properly.
write() returns how many characters were written, or -1 on error. It never ever contains any error information beyond that. When the return value does not match 'count', then you must use 'errno' to see what the problem was. The action to take beyond that depends on if the file descriptor is blocking or not. For a disk file, and looking at your sample code, your writes are blocking. So the only two possible return values from write() are -1 and 'count'. If you get -1. then you should set code to 'errno'.
-- Roger Oberholtzer
OPQ Systems / Ramböll RST
Ramböll Sverige AB Krukmakargatan 21 P.O. Box 17009 SE-104 62 Stockholm, Sweden
Office: Int +46 10-615 60 20 Mobile: Int +46 70-815 1696
-- To unsubscribe, e-mail: opensuse-programming+unsubscribe@opensuse.org For additional commands, e-mail: opensuse-programming+help@opensuse.org
-- To unsubscribe, e-mail: opensuse-programming+unsubscribe@opensuse.org For additional commands, e-mail: opensuse-programming+help@opensuse.org
participants (2)
-
Cameron Seader
-
Roger Oberholtzer