I'm trying to discover the cause of some long delays in the running of a program. I'm running a Perl script that is comparing the files in two directories. There are about 150,000 files of a few KB each in each directory, which mostly correspond but they're not all identical. My script is running diff on each pair. It prints a timestamped line for each pair and this scrolls up the screen but sometimes stops for several seconds - ten is the most I've noticed - Is this normal? As you would expect, the disk activity light is on all the time. The directories are in the same reiserfs filesystem on a 120 GB PATA disk. hdparm -t says 53 MB/s and -T says 2015 MB/s. It's a 2.8 GHz P4 on an Intel 875P mobo with 2 GB memory. I've just upgraded the system to Suse 9.1; kernel 2.6.4-52. Cheers, Dave
Dave wrote regarding '[SLE] slow reiserfs?' on Thu, Sep 09 at 08:06:
I'm trying to discover the cause of some long delays in the running of a program.
I'm running a Perl script that is comparing the files in two directories. There are about 150,000 files of a few KB each in each directory, which mostly correspond but they're not all identical. My script is running diff on each pair. It prints a timestamped line for each pair and this scrolls up the screen but sometimes stops for several seconds - ten is the most I've noticed -
Is this normal?
As you would expect, the disk activity light is on all the time. The directories are in the same reiserfs filesystem on a 120 GB PATA disk. hdparm -t says 53 MB/s and -T says 2015 MB/s. It's a 2.8 GHz P4 on an Intel 875P mobo with 2 GB memory. I've just upgraded the system to Suse 9.1; kernel 2.6.4-52.
Have you considered reading all files into memory first, then doing the diff from memory? That would eliminate repetative disk reads, and probably speed things up. Meomory use would spike then continually decrease if you read them all and popped the file to compare off the end of an array. Anyway, yes, it takes a long time to exacute all of the system calls required to open and close a whole bunch of files. Add some kind of progress indicator (print "$filenumber\r") inside one of your loops and see if it stops at around the same position each time. There may just be a point where your program is getting ahead of the disk. You might be able to put the files into a data structure that marks which have been compared and which haven't, then start comparing files before you've read them all in, while you use some shared memory / threading / a pipe w/ buffer to handle doing some comparisons simultaneous with the remainder of the disk reads... --Danny
Dave, Danny, On Thursday 09 September 2004 10:46, Danny Sauer wrote:
Dave wrote regarding '[SLE] slow reiserfs?' on Thu, Sep 09 at 08:06:
I'm trying to discover the cause of some long delays in the running of a program.
I'm running a Perl script that is comparing the files in two directories. There are about 150,000 files of a few KB each in each directory, which mostly correspond but they're not all identical. My script is running diff on each pair. It prints a timestamped line for each pair and this scrolls up the screen but sometimes stops for several seconds - ten is the most I've noticed -
Dave, you say you're using "diff" to compare the files. Do you need to know exactly _how_ the files differ, or only _that_ they differ? If it's the latter, then use the "cmp" command and you'll cut down on the CPU time consumed. Of course, this will only make the process more disk-bound than it already is, but that's still going produce some improvement in overall run-time. Also, I take it the files are compared pair-wise between the two directories, and that you're not comparing all the possible combinations of a file from one directory with another from the other directory. If that's the case, I'd use one of the checksum-generating utilities (sum, cksum, md5sum, sha1sum) to get a signature for all the files and base the comparison process on those signatures.
Is this normal?
It depends on what's happening during those pauses. It could be a sign that a recoverable disk error occurred and retries are happening. It could be interference from some other, higher-priority task. It's even conceivable that that network activity is getting into the mix and you're experiencing a time-out and retry on something like a DNS, NIS or LDAP transaction. While that may sound far-fetched, the shell does a lot of stuff between commands that are completely unrelated to the essential operations of your script and surprising as it may seem, network activity is a distinct possibility, especially if you're working in an NFS / NIS / YP / Samba type environment. Keep in mind, too, that when you use Perl to invoke external commands such as "diff" it's launching a new shell for each such command, and that shell goes the the full start-up processing (i.e., reading its "$HOME/...rc" file, depending on which shell it is). If you're running X, you could run XOSview and watch what it does during these pauses. A little more detail can be gained with "top" or one of the "...stat" family of commands (iostat, vmstat or mpstat), which can be made to run continuously with periodic output of incremental statistics.
As you would expect, the disk activity light is on all the time. The directories are in the same reiserfs filesystem on a 120 GB PATA disk. hdparm -t says 53 MB/s and -T says 2015 MB/s. It's a 2.8 GHz P4 on an Intel 875P mobo with 2 GB memory. I've just upgraded the system to Suse 9.1; kernel 2.6.4-52.
Have you considered reading all files into memory first, then doing the diff from memory? That would eliminate repetative disk reads, and probably speed things up. Meomory use would spike then continually decrease if you read them all and popped the file to compare off the end of an array.
That will only make things worse. Let the kernel do its thing and handle mamagement of main memory and cacheing of recently used file content. This has been greatly refined over the years, and you probably cannot do better unless your file access patterns are quite unusual.
Anyway, yes, it takes a long time to exacute all of the system calls required to open and close a whole bunch of files. Add some kind of progress indicator (print "$filenumber\r") inside one of your loops and see if it stops at around the same position each time. There may just be a point where your program is getting ahead of the disk. You might be able to put the files into a data structure that marks which have been compared and which haven't, then start comparing files before you've read them all in, while you use some shared memory / threading / a pipe w/ buffer to handle doing some comparisons simultaneous with the remainder of the disk reads...
That demands some kind of threaded or concurrent programming. Linux makes that possible, but it's not easy to do correctly, and it would definitely take one beyond the realm of basic scripting!
--Danny
Randall Schulz
Randall wrote regarding 'Re: [SLE] slow reiserfs?' on Thu, Sep 09 at 13:15:
Dave, Danny,
On Thursday 09 September 2004 10:46, Danny Sauer wrote:
Dave wrote regarding '[SLE] slow reiserfs?' on Thu, Sep 09 at 08:06:
I'm trying to discover the cause of some long delays in the running of a program.
I'm running a Perl script that is comparing the files in two directories. There are about 150,000 files of a few KB each in each directory, which mostly correspond but they're not all identical. My script is running diff on each pair. It prints a timestamped line for each pair and this scrolls up the screen but sometimes stops for several seconds - ten is the most I've noticed -
Dave, you say you're using "diff" to compare the files. Do you need to know exactly _how_ the files differ, or only _that_ they differ? If it's the latter, then use the "cmp" command and you'll cut down on the CPU time consumed. Of course, this will only make the process more disk-bound than it already is, but that's still going produce some improvement in overall run-time.
Either way, it'd be faster to use one of the perl modules that implements the diff algorithm rather than launching the diff program. If it's just "do they differ" then it'd be quicker to just cmp them line by line... Calculating a checksum will require reading the whole file, and by the time you've read in the file, you could've been comparing it to the other file and be done. A checksum would only be useful if you were using the contents of the file more than once - in which case it'd cut down on memory consumption quite a bit (though you'd still have to compare the files directly if the checksum matched, given the possibility of multiple files generating the same checksum with some algorithms). Ignore the whole "read them into memory" thing, though. I read the OP as "comparing the files in the directories" instead of "comparing the files in two directories". --Danny, who should really change his screen font...
Dave, On Thursday 09 September 2004 10:46, Danny Sauer wrote:
Dave wrote regarding '[SLE] slow reiserfs?' on Thu, Sep 09 at 08:06:
I'm trying to discover the cause of some long delays in the running of a program.
I'm running a Perl script that is comparing the files in two directories. There are about 150,000 files of a few KB each in each directory, which mostly correspond but they're not all identical. My script is running diff on each pair. It prints a timestamped line for each pair and this scrolls up the screen but sometimes stops for several seconds - ten is the most I've noticed -
....
Also, I take it the files are compared pair-wise between the two directories, and that you're not comparing all the possible combinations of a file from one directory with another from the other directory. If that's the case, I'd use one of the checksum-generating utilities (sum, cksum, md5sum, sha1sum) to get a signature for all the files and base the comparison process on those signatures.
I didn't phrase that too well. The distinction I meant to draw was between comparing pairs in which each file from one directory was paired with a single file from the other directory vs. creating all the pairs possible between each file in one directory and all the files in the other directory. In other words, if there are N files in each directory, is it N file comparisons or N-squared file comparisons? If it's N-squared, then you want to use the signature-based technique. It will cut the run-time vastly over naively performing all the comparisons. ...
--Danny
Randall Schulz
participants (3)
-
Danny Sauer
-
Dave Howorth
-
Randall R Schulz