
Hello everyone, how could i make a md5sum file calculated from files stored in a directory ( like in ftp's ... ) ? Thanks for your help -- Laurent Renard

On Friday 21 January 2005 11:02, Laurent Renard wrote: | Hello everyone, | | how could i make a md5sum file calculated from files stored in a | directory ( like in ftp's ... ) ? Non-recursive: $ find <directory> -type f -maxdepth 1 | xargs md5sum > checksums.txt Recursive: $ find <directory> -type f | xargs md5sum > checksums.txt Regards, -- Lars Haugseth

Lars wrote regarding 'Re: [SLE] Md5sum file from dir' on Fri, Jan 21 at 04:29:
If you just want all of the files in a dir, just do md5sum * > checksums.txt It'll skip directories (and warn on the sommand line for each skipped dir, unless you add a 2>/dev/null). No need to fire up find *and* xargs in that case. :) --Danny, who generally prefers -exec to piping through xargs...

Danny, Laurent, On Friday 21 January 2005 07:32, Danny Sauer wrote:
But there is a limit on how many characters of arguments can be passed in a single command invocation. The whole point of xargs is to deal with that limit. So if you use the short version of the command suggested by Danny in a sufficiently large directory (where large is defined both in terms of the number of entries, including unacceptable entries such as directories, and the length of the names), it will fail and the xargs approach becomes necessary.
--Danny, who generally prefers -exec to piping through xargs...
And when you use "-exec" a fork/exec pair happens for _every find hit_! The overhead is considerable and becomes noticeable for even moderately populated directories or directory hierarchies. Randall Schulz

Randall wrote regarding 'Re: [SLE] Md5sum file from dir' on Fri, Jan 21 at 10:44:
True, but in most common uses, the much shorter "md5sums *" is quite a bit easier to type - both in terms of speed and accuracy. In the border cases where there are a whole lot of entries in a directory, xargs will definately save you.
I'm looking at the file xargs.c right now. Line 766 from findutils 4.1.20 - the beginning of "do_exec". --- while ((child = fork ()) < 0 && errno == EAGAIN && procs_executing) wait_for_proc (false); switch (child) { case -1: error (1, errno, _("cannot fork")); case 0: /* Child. */ execvp (cmd_argv[0], cmd_argv); error (0, errno, "%s", cmd_argv[0]); _exit (errno == ENOENT ? 127 : 126); } add_proc (child); } --- It's essentially the same code in find (function launch in pred.c). Then, why is xargs faster? Oh, wait, looking closer... It appears that xargs builds a commandline up until the maximum length on a system, and then runs that. I always thought that xargs ran a new instance of the exec()'d program for each individual argument. Hmph. I guess that'd explain why it takes 8-10 seconds to do all 2642 files in the samba3 source dir (which was convenient at the time) using -exec, while it only takes about 2 seconds to do the same thing with xargs. The -exec method is more flexible, though, and will work with programs that only accept one file as an arg, etc. I guess it's just a matter of choosing the right tool for the job, and sinc -exec always works, I tend to go that way. It's easier to remember one thing than 2. :) --Danny, who learned something today

Danny, On Friday 21 January 2005 09:49, Danny Sauer wrote:
True enough. But another thing to keep in mind, at least in general, is that not all systems that support the Gnu tools (find and xargs, e.g.) implement copy-on-write fork semantics. On such systems the performance disparity between the -exec and xargs approaches is far greater than what you'd experience on a Linux system. Cygwin is such a Gnu tools platform, e.g.
--Danny, who learned something today
So... It's a _good_ day, right? Randall Schulz

Lars wrote regarding 'Re: [SLE] Md5sum file from dir' on Fri, Jan 21 at 13:05:
I normally just reimplement that kind of behavior in perl - which is my standard solution when the shell won't do what I want. :) I've never actually used xargs, except when I was benchmarking it against find. Though, having read the man page now, I think I may have a use for the process accounting ability - the -P option - where I've wanted to make a few jobs run in parallel, but I haven't gotten around to actually implementing the logic neccesary to keep track of fork()'d children... Darn you people. I clearly said that I didn't want to have to remember more than one thing, but this discussion has forced me to not only read and understand the sorce for xargs, but also to consider rewriting several maintenence scripts. I already had enough to do! Randall, does that answer your question as to whether this is a good day or not? :) --Danny, happy to have some new "interesting" work that'll be good to break up the more mundane.

U, On Friday 21 January 2005 13:38, user86 wrote:
This is what I do in such circumstances: % find ... |tr '\n' '\0' |xargs -0 cmd fixedArgs... Lars' solution is probably better: simpler, cleaner, fewer processes, etc. Randall Schulz

On Friday 21 January 2005 11:02, Laurent Renard wrote: | Hello everyone, | | how could i make a md5sum file calculated from files stored in a | directory ( like in ftp's ... ) ? Non-recursive: $ find <directory> -type f -maxdepth 1 | xargs md5sum > checksums.txt Recursive: $ find <directory> -type f | xargs md5sum > checksums.txt Regards, -- Lars Haugseth

Lars wrote regarding 'Re: [SLE] Md5sum file from dir' on Fri, Jan 21 at 04:29:
If you just want all of the files in a dir, just do md5sum * > checksums.txt It'll skip directories (and warn on the sommand line for each skipped dir, unless you add a 2>/dev/null). No need to fire up find *and* xargs in that case. :) --Danny, who generally prefers -exec to piping through xargs...

Danny, Laurent, On Friday 21 January 2005 07:32, Danny Sauer wrote:
But there is a limit on how many characters of arguments can be passed in a single command invocation. The whole point of xargs is to deal with that limit. So if you use the short version of the command suggested by Danny in a sufficiently large directory (where large is defined both in terms of the number of entries, including unacceptable entries such as directories, and the length of the names), it will fail and the xargs approach becomes necessary.
--Danny, who generally prefers -exec to piping through xargs...
And when you use "-exec" a fork/exec pair happens for _every find hit_! The overhead is considerable and becomes noticeable for even moderately populated directories or directory hierarchies. Randall Schulz

Randall wrote regarding 'Re: [SLE] Md5sum file from dir' on Fri, Jan 21 at 10:44:
True, but in most common uses, the much shorter "md5sums *" is quite a bit easier to type - both in terms of speed and accuracy. In the border cases where there are a whole lot of entries in a directory, xargs will definately save you.
I'm looking at the file xargs.c right now. Line 766 from findutils 4.1.20 - the beginning of "do_exec". --- while ((child = fork ()) < 0 && errno == EAGAIN && procs_executing) wait_for_proc (false); switch (child) { case -1: error (1, errno, _("cannot fork")); case 0: /* Child. */ execvp (cmd_argv[0], cmd_argv); error (0, errno, "%s", cmd_argv[0]); _exit (errno == ENOENT ? 127 : 126); } add_proc (child); } --- It's essentially the same code in find (function launch in pred.c). Then, why is xargs faster? Oh, wait, looking closer... It appears that xargs builds a commandline up until the maximum length on a system, and then runs that. I always thought that xargs ran a new instance of the exec()'d program for each individual argument. Hmph. I guess that'd explain why it takes 8-10 seconds to do all 2642 files in the samba3 source dir (which was convenient at the time) using -exec, while it only takes about 2 seconds to do the same thing with xargs. The -exec method is more flexible, though, and will work with programs that only accept one file as an arg, etc. I guess it's just a matter of choosing the right tool for the job, and sinc -exec always works, I tend to go that way. It's easier to remember one thing than 2. :) --Danny, who learned something today

Danny, On Friday 21 January 2005 09:49, Danny Sauer wrote:
True enough. But another thing to keep in mind, at least in general, is that not all systems that support the Gnu tools (find and xargs, e.g.) implement copy-on-write fork semantics. On such systems the performance disparity between the -exec and xargs approaches is far greater than what you'd experience on a Linux system. Cygwin is such a Gnu tools platform, e.g.
--Danny, who learned something today
So... It's a _good_ day, right? Randall Schulz

Lars wrote regarding 'Re: [SLE] Md5sum file from dir' on Fri, Jan 21 at 13:05:
I normally just reimplement that kind of behavior in perl - which is my standard solution when the shell won't do what I want. :) I've never actually used xargs, except when I was benchmarking it against find. Though, having read the man page now, I think I may have a use for the process accounting ability - the -P option - where I've wanted to make a few jobs run in parallel, but I haven't gotten around to actually implementing the logic neccesary to keep track of fork()'d children... Darn you people. I clearly said that I didn't want to have to remember more than one thing, but this discussion has forced me to not only read and understand the sorce for xargs, but also to consider rewriting several maintenence scripts. I already had enough to do! Randall, does that answer your question as to whether this is a good day or not? :) --Danny, happy to have some new "interesting" work that'll be good to break up the more mundane.
participants (5)
-
Danny Sauer
-
Lars Haugseth
-
Laurent Renard
-
Randall R Schulz
-
user86