Re: [SLE] Md5sum file from dir

21 Jan 2005

      Randall wrote regarding 'Re: [SLE] Md5sum file from dir' on Fri, Jan 21 at 10:44:
...
Danny, Laurent,
On Friday 21 January 2005 07:32, Danny Sauer wrote:
...
Lars wrote regarding 'Re: [SLE] Md5sum file from dir' on Fri, Jan 21 
at 04:29:
...
On Friday 21 January 2005 11:02, Laurent Renard wrote:
| Hello everyone,
|
| how could i make a md5sum file calculated from files stored in a
| directory ( like in ftp's ... ) ?
Non-recursive:
 $ find <directory> -type f -maxdepth 1 | xargs md5sum >
checksums.txt
If you just want all of the files in a dir, just do
 md5sum * > checksums.txt
It'll skip directories (and warn on the sommand line for each skipped
dir, unless you add a 2>/dev/null).  No need to fire up find *and*
xargs in that case. :)
But there is a limit on how many characters of arguments can be passed 
in a single command invocation. The whole point of xargs is to deal 
with that limit.
So if you use the short version of the command suggested by Danny in a 
sufficiently large directory (where large is defined both in terms of 
the number of entries, including unacceptable entries such as 
directories, and the length of the names), it will fail and the xargs 
approach becomes necessary.
True, but in most common uses, the much shorter "md5sums *" is quite a
bit easier to type - both in terms of speed and accuracy.  In the
border cases where there are a whole lot of entries in a directory,
xargs will definately save you.
...
...
--Danny, who generally prefers -exec to piping through xargs...
And when you use "-exec" a fork/exec pair happens for _every find hit_! 
The overhead is considerable and becomes noticeable for even moderately 
populated directories or directory hierarchies.
I'm looking at the file xargs.c right now.  Line 766 from findutils
4.1.20 - the beginning of "do_exec".

---
      while ((child = fork ()) < 0 && errno == EAGAIN && procs_executing)
    wait_for_proc (false);
      switch (child)
    {
    case -1:
      error (1, errno, _("cannot fork"));

    case 0:     /* Child.  */
      execvp (cmd_argv[0], cmd_argv);
      error (0, errno, "%s", cmd_argv[0]);
      _exit (errno == ENOENT ? 127 : 126);
    }
      add_proc (child);
    }
---

It's essentially the same code in find (function launch in pred.c).
Then, why is xargs faster?

Oh, wait, looking closer...  It appears that xargs builds a
commandline up until the maximum length on a system, and then runs
that.  I always thought that xargs ran a new instance of the exec()'d
program for each individual argument.  Hmph.  I guess that'd explain
why it takes 8-10 seconds to do all 2642 files in the samba3 source
dir (which was convenient at the time) using -exec, while it only
takes about 2 seconds to do the same thing with xargs.

The -exec method is more flexible, though, and will work with programs
that only accept one file as an arg, etc.  I guess it's just a matter
of choosing the right tool for the job, and sinc -exec always works, I
tend to go that way.  It's easier to remember one thing than 2. :)

--Danny, who learned something today