Hi,
The open group specification says that write() is atomic as long
as the number of bytes is not larger than PIPE_BUF. In the following
program, sometimes only one process successfully writes to the file.
I thought that fprintf also uses write() underneath, so the file
should contains strings from both processes. Any insight?
#include
On 3/28/06, Verdi March
Hi,
The open group specification says that write() is atomic as long as the number of bytes is not larger than PIPE_BUF.
Yes, it does, but it is only specified for writes to pipes or FIFOs. In case of files, you'll need to use some kind of file/record locking. \Steve
On Tue, 2006-03-28 at 07:39 +0200, Verdi March wrote:
Hi,
The open group specification says that write() is atomic as long as the number of bytes is not larger than PIPE_BUF. In the following program, sometimes only one process successfully writes to the file. I thought that fprintf also uses write() underneath, so the file should contains strings from both processes. Any insight?
Sorry for sending the reply off-list. I'll repeat, for the benefit of the group: The meaning of "atomic" is that you won't get a task switch in the middle of the call, it is guaranteed to complete once it's started, before another process gets to call it. The reason you only get output from one process at times is that one of the processes closes the file before the other one gets to write to it. You need to have a procedure in place for making sure that all files have finished writing before you call fclose If you had checked the return value of fprintf, you would have seen errors because the process is trying to write to a file that's closed. Always check return values. It's just good programming
Hi,
--- Ursprüngliche Nachricht --- Von: Anders Johansson
An: suse-programming-e@suse.com Betreff: Re: [suse-programming-e] The meaning of atomic in write() Datum: Tue, 28 Mar 2006 20:04:06 +0200 The reason you only get output from one process at times is that one of the processes closes the file before the other one gets to write to it. You need to have a procedure in place for making sure that all files have finished writing before you call fclose
From my understanding, the second (child) process inherits its parent's file descriptor f. Therefore, both f in parent and child point to the same open-file structure (according to UNIX book). I thought that kernel reclaims an open-file structure only when it's not pointed anymore by descriptors?
If you had checked the return value of fprintf, you would have seen errors because the process is trying to write to a file that's closed. Always check return values. It's just good programming
Each call to fprintf will always success because it sends the string to a buffer allocated by C library (which is default to 8KB on my system). Because the string is less than 8KB, fprintf() does not causes in write(). The write() is issued only during fclose(), and I've confirmed this with strace. I've modified the program to assert(fclose(f) == 0) (and assert(fprintf >= 0) just to make sure). But still, I get the same result: even if only one process succesfully writes, no assertion is violated. In the beginning I thought that this would not happen due to write() atomicity, just as Anders said. This means, the second write() continues writing from the offset after the first write(), resulting in test.txt to contains strings from both processes. If test.txt contains only strings of one process, this implies that write() is not atomic because both they are "overlap" (somehow to start writing from beginning of file). Earlier, Steve mentioned that the atomicity is valid only for FIFO or pipe. But I remembered vaguely from a newsgroup thread that the atomicity also holds for file? I realizes that the "text-book" solution is to lock the file before writing. Initially this program is just to show the effect of user-level I/O buffering by C library. But now I become more curious on this concurrent file access. -- Regards, Verdi -- Bis zu 70% Ihrer Onlinekosten sparen: GMX SmartSurfer! Kostenlos downloaden: http://www.gmx.net/de/go/smartsurfer
On Wed, 2006-03-29 at 09:08 +0200, Verdi March wrote:
From my understanding, the second (child) process inherits its parent's file descriptor f. Therefore, both f in parent and child point to the same open-file structure (according to UNIX book). I thought that kernel reclaims an open-file structure only when it's not pointed anymore by descriptors?
Well, you are closing it explicitly, but you are right, I am an idiot. The child gets a new copy of the fd which is almost independent of the original, so it stays open. This is also why you get the effect you see, because both processes will have a local file location pointer, so both will start at position 0, the second one overwriting the first.
In the beginning I thought that this would not happen due to write() atomicity, just as Anders said. This means, the second write() continues writing from the offset after the first write(), resulting in test.txt to contains strings from both processes.
No, that's not what I said. Atomic means "uninterrupted". It means each call to write() will be allowed to complete before another call gets access to it. However, what I missed in my hazy thinking was that the file descriptor is copied, not shared, so each process has its own idea of where to write in the file. file locking or record locking won't help in this case, you need some other way of letting the processes cooperate, so they know where in the file to write
Hi,
--- Ursprüngliche Nachricht --- Von: Anders Johansson
An: suse-programming-e@suse.com Betreff: Re: [suse-programming-e] The meaning of atomic in write() Datum: Wed, 29 Mar 2006 18:28:23 +0200 No, that's not what I said. Atomic means "uninterrupted". It means each call to write() will be allowed to complete before another call gets access to it.
However, what I missed in my hazy thinking was that the file descriptor is copied, not shared, so each process has its own idea of where to write in the file. file locking or record locking won't help in this case, you need some other way of letting the processes cooperate, so they know where in the file to write
I think you're right, that "atomic" does not cover the location
to start writing. If I changed my program a little bit:
#include
Hi there, On Thu, 30 Mar 2006, 08:41:52 +0200, Verdi March wrote:
Hi,
--- Ursprüngliche Nachricht --- Von: Anders Johansson
An: suse-programming-e@suse.com Betreff: Re: [suse-programming-e] The meaning of atomic in write() Datum: Wed, 29 Mar 2006 18:28:23 +0200 No, that's not what I said. Atomic means "uninterrupted". It means each call to write() will be allowed to complete before another call gets access to it.
However, what I missed in my hazy thinking was that the file descriptor is copied, not shared, so each process has its own idea of where to write in the file. file locking or record locking won't help in this case, you need some other way of letting the processes cooperate, so they know where in the file to write
I think you're right, that "atomic" does not cover the location to start writing. If I changed my program a little bit: [...]
I'm afraid everyone in this thread expects something that *buffered I/O* cannot provide. Opening a file using "fopen" creates a FILE structure with some buffer attached to it - unless the file to be opened appears to be an interactive device. If you really want to use the high-level (3) routines (fopen, fprintf, fread, fwrite, ...) _and_ want to ensure that no buffering inside a library is getting in your ways, you need to, at least, call "setbuf (fp, NULL)" to ensure *no* buffering will happen. BUT, it will not change anything wrt/ HOW the underlying kernel will deal with (non synchronous) write's going to the file - that's what you've seen. IF you want to change this, you need to take care there's a proper protocol for dealing with synchronous I/O used by your application; did you try something like fd = open ("path", O_CREAT | O_TRUNC | O_DIRECT, 0644); fp = fdopen (fd, "rw"); (proper checking of return codes assumed!)? Still, according to the "open(3)" manual page: O_DIRECT Try to minimize cache effects of the I/O to and from this file. In general this will degrade performance, but it is useful in special situations, such as when applications do their own caching. File I/O is done directly to/from user space buffers. The I/O is synchronous, i.e., at the completion of a read(2) or write(2), data is guaranteed to have been transferred. Under Linux 2.4 transfer sizes, and the alignment of user buffer and file offset must all be multiples of the logical block size of the file system. Under Linux 2.6 alignment to 512-byte bound‐ aries suffices. A semantically similar interface for block devices is described in raw(8). there is absolutely NO guarantee for any data smaller than 512 bytes to appear in file the same order you might have intended. If you really need this, there is no other way than to "do their own caching". Getting back to the original poster's question, "fprintf" is using "write" at _some_ point in time, exactly when it'll do depends on something like the associated buffer's size... HTH, cheers. l8er manfred
I'm afraid everyone in this thread expects something that *buffered I/O* cannot provide. Note everyone :-) Opening a file using "fopen" creates a FILE structure with some buffer attached to it - unless the file to be opened appears to be an interactive device. If you really want to use the high-level (3) routines (fopen, fprintf, fread, fwrite, ...) _and_ want to ensure that no buffering inside a library is getting in your ways, you need to, at least, call "setbuf (fp, NULL)" to ensure *no* buffering will happen. BUT, it will not change anything wrt/ HOW the underlying kernel will deal with (non synchronous) write's going to the file - that's what you've seen. IF you want to change this, you need to take care there's a proper protocol for dealing with synchronous I/O used by your application; did you try something like
fd = open ("path", O_CREAT | O_TRUNC | O_DIRECT, 0644); fp = fdopen (fd, "rw");
(proper checking of return codes assumed!)?
Still, according to the "open(3)" manual page:
O_DIRECT Try to minimize cache effects of the I/O to and from this file. In general this will degrade performance, but it is useful in special situations, such as when applications do their own caching. File I/O is done directly to/from user space buffers. The I/O is synchronous, i.e., at the completion of a read(2) or write(2), data is guaranteed to have been transferred. Under Linux 2.4 transfer sizes, and the alignment of user buffer and file offset must all be multiples of the logical block size of the file system. Under Linux 2.6 alignment to 512-byte bound‐ aries suffices. A semantically similar interface for block devices is described in raw(8).
there is absolutely NO guarantee for any data smaller than 512 bytes to appear in file the same order you might have intended. If you really need this, there is no other way than to "do their own caching".
Getting back to the original poster's question, "fprintf" is using "write" at _some_ point in time, exactly when it'll do depends on something like the associated buffer's size... I agree pretty much with your statement. The OP was opening a file for writing, then performing a fork where both the
On Friday 31 March 2006 7:37 am, Manfred Hollstein wrote:
parent and child were writing to the file simultaneously. I took the
program that the OP submitted and ran it on SuSE 10 (32-bit), and ran it on
RHEL 4 (64-bit IA64). Both had the same successful result without the
overwriting that the OP experienced. (In both cases, 2.6 kernel).
The issues that come into play is that the buffers for streamIO (eg. FILE)
are user space buffers that are flushed based on a number of rules. The
flush might occur in the fprintf, or when the user calls fflush(3), or when
the user closes the file. But, there are also kernel buffers, and some
data can remain in kernel buffers for a short time before those are
physically written. This is a function of the driver and when the sync(2)
system call commits the buffers.
--
Jerry Feldman
On Fri, 2006-03-31 at 08:00 -0500, Jerry Feldman wrote:
I agree pretty much with your statement. The OP was opening a file for writing, then performing a fork where both the parent and child were writing to the file simultaneously. I took the program that the OP submitted and ran it on SuSE 10 (32-bit), and ran it on RHEL 4 (64-bit IA64). Both had the same successful result without the overwriting that the OP experienced. (In both cases, 2.6 kernel).
Are you implying that the file location pointer is shared across a dup() call in these systems?
On Friday 31 March 2006 1:46 pm, Anders Johansson wrote:
On Fri, 2006-03-31 at 08:00 -0500, Jerry Feldman wrote:
I agree pretty much with your statement. The OP was opening a file for writing, then performing a fork where both the parent and child were writing to the file simultaneously. I took the program that the OP submitted and ran it on SuSE 10 (32-bit), and ran it on RHEL 4 (64-bit IA64). Both had the same successful result without the overwriting that the OP experienced. (In both cases, 2.6 kernel).
Are you implying that the file location pointer is shared across a dup() call in these systems? No. A fork(2) clones the parent's environment. Actually, they share the same memory, but the pages are marked copy-on-write which means that when there is a change, the page is copied before it is dirtied, so it is has a similar behavior if the copy operation was done.
What I am implying is that after the fork(2) occurs, and both processes are
writing to the same file, data integrity is maintained. I need to qualify
this a bit more in that the write(2) system call is atomic, but the two
processes data may be interleaved depending on the amount of data.
This is the result I get on both SuSE 10 (32-bit single processor) and a
dual processor IA64 (HP Integrity 1620).
gaf@cedar C]$ cat test.txt
28483: 1
28483: 2
28483: 3
bufsiz=8192 pipe_buf=4096
Parent is process 28482
28482: 1
28482: 2
28482: 3
In the case of dup(2), you are simply making a copy of a file descriptor.
The file descriptors all point to the same open file description, so that:
fd2 = dup(fd1);
write(fd1, buf, count); // writes to the current location, and bumps the
location pointer by count.
write(fd2, buf, count); // Writes to the updated location pointer set
previously.
in this case:
fd1 = open(...);
pid = fork();
if (pid == 0) { // child
write(fd1, ...) // updates location
...
close and exit
} else if (pid > 0) { // parent
write (fd1, ...); // updates location
...
close
wait
exit
} else { // error pid == -1)
report error and exit
}
In the above code, there is a race condition in that either the parent or
the child writes first, so the second data written will be written to a
location beyond where the first data was written. Whether the parent or
child writes first depends on a number of things.
--
Jerry Feldman
Hi, On Saturday 01 April 2006 03:23, Jerry Feldman wrote:
In the above code, there is a race condition in that either the parent or the child writes first, so the second data written will be written to a location beyond where the first data was written. Whether the parent or child writes first depends on a number of things.
yep, the lesson that I learned. Initially I expect that 'atomic' covers both the location pointer and the 'no interleaving of bytes up to a certain size'. Turns out that the first is not. And the occurrance of the race condition varies among platforms. I used a shell script that repeatedly executes the program. The shell script stops until the race condition occurs. On SUSE 9.3 and FC4 (64-bit Opteron), the race condition can occur pretty fast, while on SunOS it does not occur (but maybe it's because I didn't run the script long enough). -- Regards, Verdi
On Saturday 01 April 2006 17:33, Verdi March wrote:
I used a shell script that repeatedly executes the program. The shell script stops until the race condition occurs.
To be specific, stops when the file contains "overlapping" output. -- Regards, Verdi
On Sat, 1 Apr 2006 17:33:00 +0800
Verdi March
yep, the lesson that I learned. Initially I expect that 'atomic' covers both the location pointer and the 'no interleaving of bytes up to a certain size'. Turns out that the first is not.
And the occurrance of the race condition varies among platforms. I used a shell script that repeatedly executes the program. The shell script stops until the race condition occurs. On SUSE 9.3 and FC4 (64-bit Opteron), the race condition can occur pretty fast, while on SunOS it does not occur (but maybe it's because I didn't run the script long enough). The location pointer should be atomic. I did not run a recent stress test on the 2.6 kernel. One of the problems is that you used streamIO, where the buffering is both in user space and underneath in kernel space. The system calls, open(2), dup(2), write(2), close(2) are atomic. And as I mentioned, they refer to the same single kernel open file structure. But, the streamIO functions are library functions and are not guaranteed to be atomic. There are a number of methods in Linux to guarantee atomicity. You can use file locking, fcntl(2), flock(2), lockf(3):
child:
flock(2) // set the lock
fprintf(3)
fflush(3)
flock(2) // release the lock
...
Parent does the same thing.
You still have a race condition as to whose data is going to be written
first, but that is an application decision.
Another way is to write your own function using the vsprintf(3)
function. I know that using a fixed size buffer here is unsafe, but I'm
using it for the example. In the function, below, by using write(2) you
are bypassing the stream's file structure (and its own location
pointer).
int myprintf(FILE *stream, const char *fmt, ...)
{
va_list lp;
va_start(lp, fmt);
int rc;
size_t wrc;
char buf[some size];
rc = vsprintf(buf, lp); /* move stuff into buf */
(check rc to make sure vsprintf succeeded)
wrc = write(fileno(stream), buf, strlen(buf));
(check wrc)
return 0;
}
--
Jerry Feldman
Hi, On Saturday 01 April 2006 21:56, Jerry Feldman wrote:
The location pointer should be atomic. I did not run a recent stress test on the 2.6 kernel. One of the problems is that you used streamIO, where the buffering is both in user space and underneath in kernel space. The system calls, open(2), dup(2), write(2), close(2) are atomic. And as I mentioned, they refer to the same single kernel open file structure. But, the streamIO functions are library functions and are not guaranteed to be atomic.
[deleted]
Another way is to write your own function using the vsprintf(3) function. I know that using a fixed size buffer here is unsafe, but I'm using it for the example. In the function, below, by using write(2) you are bypassing the stream's file structure (and its own location pointer).
I tried out your suggestion to bypass C stdio. I purposely open a file using O_WRONLY|O_CREAT|O_TRUNC only (no O_APPEND or O_DIRECT), then do the write without locking. The result, I can still get the race condition where both processes write from the beginning of the file. Looks like use either O_APPEND or application-level handling to guarantee that the race condition won't happen. -- Regards, Verdi
On Tuesday 04 April 2006 11:49 am, Verdi March wrote:
Hi,
On Saturday 01 April 2006 21:56, Jerry Feldman wrote:
The location pointer should be atomic. I did not run a recent stress test on the 2.6 kernel. One of the problems is that you used streamIO, where the buffering is both in user space and underneath in kernel space. The system calls, open(2), dup(2), write(2), close(2) are atomic. And as I mentioned, they refer to the same single kernel open file structure. But, the streamIO functions are library functions and are not guaranteed to be atomic.
[deleted]
Another way is to write your own function using the vsprintf(3) function. I know that using a fixed size buffer here is unsafe, but I'm using it for the example. In the function, below, by using write(2) you are bypassing the stream's file structure (and its own location pointer).
I tried out your suggestion to bypass C stdio. I purposely open a file using O_WRONLY|O_CREAT|O_TRUNC only (no O_APPEND or O_DIRECT), then do the write without locking. The result, I can still get the race condition where both processes write from the beginning of the file. You should not get that under any case. char buf[SOME_SIZE]; fd = open(filename, O_WRONLY|O_CREAT|O_TRUNC ); // create the empty file if (pid = fork()) == 0) { // child
sprintf(buf, "child pid is %d\n", getpid()); write(fd, buf, strlen(buf)); } else if (pid > 0) { // parent sprintf(buf, "Parent pid is %d\n", getpid()); write(fd, buf, strlen(buf)); } else { assert... } // both for(i = 0; i < 4; i++) { sprintf(buf, "%d: Iteration %d\n", getpid(), i); write(fd, buf, strlen(buf)); } close(fd); exit(0); } The file descriptor points to the same file structure, so that the location pointers should be correct. The race condition is simply who writes first, If you want you can either post your code or send me the code at gaf@hp.com
Looks like use either O_APPEND or application-level handling to guarantee that the race condition won't happen.
--
Jerry Feldman
Hi Jerry, On Wednesday 05 April 2006 01:01, Jerry Feldman wrote:
You should not get that under any case.
[deleted]
The file descriptor points to the same file structure, so that the location pointers should be correct. The race condition is simply who writes first, If you want you can either post your code or send me the code at gaf@hp.com
I attach the C program, some sample outputs I've collected, and two shell scripts for stress-testing (one for detecting child-overwrites-parent, the other for parent-overwrites-child). Both shell scripts expect a.out in the current directory. The parent-overwrites-child happens more frequently than child-overwrites-parent. In fact, on an Opteron machine (Fedora), only parent-overwrites-child occurs, but not child-overwrites-parent (though I've waited long enough). -- Regards, Verdi
Thanks. I will take a close look at your code and try it on SuSE 10 (single 32-bit) and one of our HP Integrity servers (either RHEL 4 or SLES9) in the lab. On Wednesday 05 April 2006 10:39 am, Verdi March wrote:
The parent-overwrites-child happens more frequently than child-overwrites-parent. In fact, on an Opteron machine (Fedora), only parent-overwrites-child occurs, but not child-overwrites-parent (though I've waited long enough).
--
Jerry Feldman
I found a document from a reliable source that states that write(2) is
NOT atomic.
http://marc.theaimsgroup.com/?l=linux-kernel&m=107375454908544
"There are file descriptors that have atomicity guarantees (pipes(,
but regular files do not".
Linus Torvalds.
Also note that I used flock(2) as a quick check, but I had forgotten
that flock(2) does not lock on NFS. It works quite well on local disks.
--
Jerry Feldman
Hi, On Saturday 08 April 2006 07:58, Jerry Feldman wrote:
I found a document from a reliable source that states that write(2) is NOT atomic. http://marc.theaimsgroup.com/?l=linux-kernel&m=107375454908544 "There are file descriptors that have atomicity guarantees (pipes(, but regular files do not". Linus Torvalds.
Thanks, this should clarify everything. -- Regards, Verdi
On Sun, 9 Apr 2006 00:51:37 +0800
Verdi March
Hi,
On Saturday 08 April 2006 07:58, Jerry Feldman wrote:
I found a document from a reliable source that states that write(2) is NOT atomic. http://marc.theaimsgroup.com/?l=linux-kernel&m=107375454908544 "There are file descriptors that have atomicity guarantees (pipes(, but regular files do not". Linus Torvalds.
Thanks, this should clarify everything. I suggest that you use the lockf(3) or fcntl(2) functions to lock and unlock regions of a file before you write. Flock(2) will work fine on local drives but not on NFS file systems. -- Jerry Feldman
Boston Linux and Unix user group http://www.blu.org PGP key id:C5061EA9 PGP Key fingerprint:053C 73EC 3AC1 5C44 3E14 9245 FB00 3ED5 C506 1EA9
From my understanding, the second (child) process inherits its parent's file descriptor f. Therefore, both f in parent and child point to the same open-file structure (according to UNIX book). I thought that kernel reclaims an open-file structure only when it's not pointed anymore by descriptors? The way it works: open(2) opens a file and returns a file descriptor. The use count on that file is incremented. fork(2) replicates the file descriptors such that the parent and child now have their own, and the use count is incremented. At some point, the unlink(2) system call is called. This removes the file name from the specified directory and decrements the use count. Assuming
On Wednesday 29 March 2006 2:08 am, Verdi March wrote:
the file only had a single hard link, the use count is now 2.
Parent closes file. Use count is decremented.
The child can still write.
The child closes file, and the use count becomes zero, and the OS physically
deletes the file.
Note that the system calls are atomic.
--
Jerry Feldman
On Wed, 2006-03-29 at 11:33 -0500, Jerry Feldman wrote:
The way it works: open(2) opens a file and returns a file descriptor. The use count on that file is incremented. fork(2) replicates the file descriptors such that the parent and child now have their own, and the use count is incremented. At some point, the unlink(2) system call is called.
No, the syscall is close(), not unlink() What confused me initially was that I fooled myself into thinking parent and child shared a single fd, which got closed. Obviously this is not the case
On Wednesday 29 March 2006 11:38 am, Anders Johansson wrote:
On Wed, 2006-03-29 at 11:33 -0500, Jerry Feldman wrote:
The way it works: open(2) opens a file and returns a file descriptor. The use count on that file is incremented. fork(2) replicates the file descriptors such that the parent and child now have their own, and the use count is incremented. At some point, the unlink(2) system call is called.
No, the syscall is close(), not unlink()
What confused me initially was that I fooled myself into thinking parent and child shared a single fd, which got closed. Obviously this is not the case I added the unlink(2) to the mix showing that you can also write to (or read from) a deleted file. -- Jerry Feldman
Boston Linux and Unix user group http://www.blu.org PGP key id:C5061EA9 PGP Key fingerprint:053C 73EC 3AC1 5C44 3E14 9245 FB00 3ED5 C506 1EA9
participants (6)
-
Anders Johansson
-
Jerry Feldman
-
Jerry Feldman
-
Manfred Hollstein
-
Steve Graegert
-
Verdi March