Re: [SLE] finding identical files

18 Aug 2006

      Jan, Istvan,

On Friday 18 August 2006 01:53, Jan Engelhardt wrote:
...
...
I'd like to back up my files. I have zillion directories with
zillion subdirs and it's possible that some files are duplicated. I
would like to find them before backing up.  I found a software that
can do this but it's not free (zsDuplicateHunter).
Do you know one that is opensource and available for suse?
find /dir -type f -print0 | xargs -0 md5sum | sort
should do the trick.
To which I'd add one more pipline stage:

	|uniq -D -w 32

Hint:

% uniq --help |egrep -e '-D|-w'uniq --help
...
  -D, --all-repeated[=delimit-method] print all duplicate lines
...
  -w, --check-chars=N   compare no more than N characters in lines
...

By adding this you'll see only the duplicated entries.

I ran this on my publications directory (where I keep downloaded 
academic papers, mostly) and out of 5,577 files occupying 1.4 GB it 
found 432 unique duplicated files with 954 duplicates found.

So in honor of its usefulness, I've created this script (to be 
elaborated later):

~/bin/duploc:
-==--==--==--==--==--==--==--==--==--==--==--==--==--==--==-
#!/bin/bash --norc

    find "$@" -type f -print0 \
        |xargs -0 md5sum \
        |sort \
        |uniq -D -w 32
-==--==--==--==--==--==--==--==--==--==--==--==--==--==--==-

Note that as written you can give find options (-follow, e.g.) after the 
directory names.
...
Jan Engelhardt
Randall Schulz

Re: [SLE] finding identical files

Randall R Schulz