[opensuse] How to remove binary garbage from the front of a binary file??
All, I don't know the origin of these files, but I have a 100GB of corrupted PST files.
From what I can tell some sort of a processing / extraction tool went haywire and prepended binary junk in the front of the real data. The actual start of the data is a header with !BDN as the first 4 chars.
The prepended junk from what I've seen can be roughly from 10-500 binary octets (chars). It is sort of like ram slack, but at the start of the files. (No idea how that happened). If I knew for certain that the binary junk didn't have any newlines in it, this sed script would get rid of the junk: find . -name \*.pst -exec sed -e '1s/^.*!BDN/!BDN/' -i "{}" \; I know I can write a program to do the same but working in binary and not worrying about intervening newlines. Is there a relatively straight forward way to accomplish the above? fyi: I'm going to try and get replacement uncorrupted data files as well, but that might be easier said than done. Thanks Greg -- Greg Freemyer Advances are made by answering questions. Discoveries are made by questioning answers. — Bernard Haisch -- To unsubscribe, e-mail: opensuse+unsubscribe@opensuse.org To contact the owner, e-mail: opensuse+owner@opensuse.org
Hi, Am 11.08.2018 um 01:00 schrieb Greg Freemyer:
All,
I don't know the origin of these files, but I have a 100GB of corrupted PST files.
From what I can tell some sort of a processing / extraction tool went haywire and prepended binary junk in the front of the real data. The actual start of the data is a header with !BDN as the first 4 chars.
sounds like the perfect occasion for using some scripting language like perl: - read the file - find the first occasion of "!BDN" - write all the rest in a new file I'm not an expert with perl, but this should be an easy thing. Karl -- To unsubscribe, e-mail: opensuse+unsubscribe@opensuse.org To contact the owner, e-mail: opensuse+owner@opensuse.org
Hello, On Sat, 11 Aug 2018, Karl Sinn wrote:
Am 11.08.2018 um 01:00 schrieb Greg Freemyer:
I don't know the origin of these files, but I have a 100GB of corrupted PST files.
From what I can tell some sort of a processing / extraction tool went haywire and prepended binary junk in the front of the real data. The actual start of the data is a header with !BDN as the first 4 chars.
sounds like the perfect occasion for using some scripting language like perl: - read the file - find the first occasion of "!BDN" - write all the rest in a new file
I'm not an expert with perl, but this should be an easy thing.
It's not that easy. First, I recommend reading $ perldoc -f binmode very carefully, and second: $ perldoc -f read $ perldoc -f write $ perldoc -f sysread $ perldoc -f syswrite Or call dd with the proper offset for the 'skip=' parameter. I'd recommend using a hex-editor suited to large files, such as vche. HTH, -dnh -- This is sick. Count me in. -- Arvid G. -- To unsubscribe, e-mail: opensuse+unsubscribe@opensuse.org To contact the owner, e-mail: opensuse+owner@opensuse.org
Dunno if this was handled yet, but: Karl Sinn wrote:
sounds like the perfect occasion for using some scripting language like perl: - read the file - find the first occasion of "!BDN" - write all the rest in a new file
I'm not an expert with perl, but this should be an easy thing.
Pretty much: #!/usr/bin/perl use warnings;use strict; # while (@ARGV) { my $fname= shift @ARGV; my $outname=$fname.".out"; my $sig="!BDN\n"; my $file; open(my $h, "<:raw", $fname) or die "cannot open $fname:$!"; $/=undef; $file=<$h>; $file =~ s/^.*!BDN\n//s; close $h; open(my $ho, ">:raw", $outname) or die "cannot open $outname for out:$!"; print $ho $file; close $ho; } ---- name it something, then in your dir of junk: rmjnk *.PST It will put the uncorrupted stuff in <file>.PST.out 1 file for each input file. for a 186M file:
time /tmp/c junkryo.mkv 1.21sec 1.01usr 0.20sys (99.96% cpu)
cmp is happy:
cmp ryo.mkv junkryo.mkv.out
created: printf "!BDN\n" >bdn cat ../palemoon-loginfail-trace.txt bdn ryo.mkv >junkryo.mkv It assumes you have enough memory to hold a file for processing. -- To unsubscribe, e-mail: opensuse+unsubscribe@opensuse.org To contact the owner, e-mail: opensuse+owner@opensuse.org
On 08/10/2018 06:00 PM, Greg Freemyer wrote:
All,
I don't know the origin of these files, but I have a 100GB of corrupted PST files.
From what I can tell some sort of a processing / extraction tool went haywire and prepended binary junk in the front of the real data. The actual start of the data is a header with !BDN as the first 4 chars.
The prepended junk from what I've seen can be roughly from 10-500 binary octets (chars). It is sort of like ram slack, but at the start of the files. (No idea how that happened).
If I knew for certain that the binary junk didn't have any newlines in it, this sed script would get rid of the junk:
find . -name \*.pst -exec sed -e '1s/^.*!BDN/!BDN/' -i "{}" \;
I know I can write a program to do the same but working in binary and not worrying about intervening newlines.
Is there a relatively straight forward way to accomplish the above?
fyi: I'm going to try and get replacement uncorrupted data files as well, but that might be easier said than done.
Thanks Greg -- Greg Freemyer Advances are made by answering questions. Discoveries are made by questioning answers. — Bernard Haisch
I'm not sure scripting is your friend here. In C is straight forward to look for you "!BDN" mark and copy it and the rest of the file to a new output file (say original_name+SUFFIX). Your mark "!BDN" must be ASCII or UTF-8 and not some strange multi-byte character set that just looks like "!BDN" on the terminal. If your PST files are hundreds of Megabytes each, then rather than a read/write to the end of the PST, it would be better to mmap the file or use sendfile. A short implementation is attached. Just compile it and its usage is: $ progname <input file> [mark (default !BDN)] Where progname is whatever you compile it to, 1st <input file> is the file to search for mark, and 2nd argument mark is the mark in the file to find. If no arguments are given it will read from stdin searing for the default mark. The default new filename will be "input file_from_data_mark" or "stdin_from_data_mark" if reading stdin. Compile instructions are at top of file. **Example Input File** $ cat dat/psttest unwanted!crap!BDNwanted!PSTdata **Example Use/Output** $ ./bin/extract_from_mark dat/psttest $ cat dat/psttest_from_data_mark !BDNwanted!PSTdata -- David C. Rankin, J.D.,P.E.
David, That was fantastically nice of you to put that together for me. Thank you, Greg -- Greg Freemyer Advances are made by answering questions. Discoveries are made by questioning answers. — Bernard Haisch On Sat, Aug 11, 2018 at 5:47 AM, David C. Rankin <drankinatty@suddenlinkmail.com> wrote:
On 08/10/2018 06:00 PM, Greg Freemyer wrote:
All,
I don't know the origin of these files, but I have a 100GB of corrupted PST files.
From what I can tell some sort of a processing / extraction tool went haywire and prepended binary junk in the front of the real data. The actual start of the data is a header with !BDN as the first 4 chars.
The prepended junk from what I've seen can be roughly from 10-500 binary octets (chars). It is sort of like ram slack, but at the start of the files. (No idea how that happened).
If I knew for certain that the binary junk didn't have any newlines in it, this sed script would get rid of the junk:
find . -name \*.pst -exec sed -e '1s/^.*!BDN/!BDN/' -i "{}" \;
I know I can write a program to do the same but working in binary and not worrying about intervening newlines.
Is there a relatively straight forward way to accomplish the above?
fyi: I'm going to try and get replacement uncorrupted data files as well, but that might be easier said than done.
Thanks Greg -- Greg Freemyer Advances are made by answering questions. Discoveries are made by questioning answers. — Bernard Haisch
I'm not sure scripting is your friend here. In C is straight forward to look for you "!BDN" mark and copy it and the rest of the file to a new output file (say original_name+SUFFIX). Your mark "!BDN" must be ASCII or UTF-8 and not some strange multi-byte character set that just looks like "!BDN" on the terminal.
If your PST files are hundreds of Megabytes each, then rather than a read/write to the end of the PST, it would be better to mmap the file or use sendfile.
A short implementation is attached. Just compile it and its usage is:
$ progname <input file> [mark (default !BDN)]
Where progname is whatever you compile it to, 1st <input file> is the file to search for mark, and 2nd argument mark is the mark in the file to find. If no arguments are given it will read from stdin searing for the default mark.
The default new filename will be "input file_from_data_mark" or "stdin_from_data_mark" if reading stdin. Compile instructions are at top of file.
**Example Input File**
$ cat dat/psttest unwanted!crap!BDNwanted!PSTdata
**Example Use/Output**
$ ./bin/extract_from_mark dat/psttest $ cat dat/psttest_from_data_mark !BDNwanted!PSTdata
-- David C. Rankin, J.D.,P.E.
-- To unsubscribe, e-mail: opensuse+unsubscribe@opensuse.org To contact the owner, e-mail: opensuse+owner@opensuse.org
On 08/11/2018 08:42 PM, Greg Freemyer wrote:
David,
That was fantastically nice of you to put that together for me.
Thank you, Greg
Sure Greg, Glad to help. I have helped with C questions on StackOverflow over the years and have a number of small bits of code that can be sewn together to help. That's the fun part of the day. Let me know if it needs tweaking or if speed is an issue on large files and I can pop a sendfile or mmap implementation in to help speed up copy of the good data to a new file. Either can provide solid reduction in copy time if the good data parst of teh files are larger than tens or hundreds of megabytes. As is C uses a BUFSIZ read buffer for basic file I/O to read data in good sized chunks to begin with (8192 bytes on Linux/512 bytes on windoze). So if this is a one pass deal, the performance gains aren't needed. If this is continual issue and you may need to scan multiple files every 3 seconds, etc., then it's worth the additional code. -- David C. Rankin, J.D.,P.E. -- To unsubscribe, e-mail: opensuse+unsubscribe@opensuse.org To contact the owner, e-mail: opensuse+owner@opensuse.org
Greg Freemyer wrote:
David,
That was fantastically nice of you to put that together for me.
Thank you, Greg -- Greg Freemyer
Totally... First thing I started doing in C++ when I got back to it, is start developing a util library to implement common stuff I wanted from perl. Otherwise it was just way too much typing and reinventing the wheel. C was worse. BTW, FWIW, that was with perl 5.16... the last version before the great incompat problems introduced in the Ricardo generation. His last gift: adding yet more sigils to vars when not needed -- pin the tail on the variable and get rid of implicit use of the typed references already in perl. During his tenure perl use decreased by about 75% (not that they hadn't been warned ...fat lotta good that did. -- To unsubscribe, e-mail: opensuse+unsubscribe@opensuse.org To contact the owner, e-mail: opensuse+owner@opensuse.org
On 08/11/2018 01:00 AM, Greg Freemyer wrote:
All,
I don't know the origin of these files, but I have a 100GB of corrupted PST files.
From what I can tell some sort of a processing / extraction tool went haywire and prepended binary junk in the front of the real data. The actual start of the data is a header with !BDN as the first 4 chars.
The prepended junk from what I've seen can be roughly from 10-500 binary octets (chars). It is sort of like ram slack, but at the start of the files. (No idea how that happened).
If I knew for certain that the binary junk didn't have any newlines in it, this sed script would get rid of the junk:
find . -name \*.pst -exec sed -e '1s/^.*!BDN/!BDN/' -i "{}" \;
I know I can write a program to do the same but working in binary and not worrying about intervening newlines.
Is there a relatively straight forward way to accomplish the above?
fyi: I'm going to try and get replacement uncorrupted data files as well, but that might be easier said than done.
# Create a binary testfile file with a '!BDN' marker. $ { cat /usr/bin/cat; printf '!BDN'; cat /usr/bin/cat; } > testfile # Use awk(1) to print everything after and including the marker. awk 'BEGIN{RS="!BDN";ORS=""} {if(n++)print}' < testfile > testfile2 Have a nice day, Berny -- To unsubscribe, e-mail: opensuse+unsubscribe@opensuse.org To contact the owner, e-mail: opensuse+owner@opensuse.org
On Sun, 12 Aug 2018 13:00:32 +0200 Bernhard Voelker <mail@bernhard-voelker.de> wrote:
On 08/11/2018 01:00 AM, Greg Freemyer wrote:
All,
I don't know the origin of these files, but I have a 100GB of corrupted PST files.
From what I can tell some sort of a processing / extraction tool went haywire and prepended binary junk in the front of the real data. The actual start of the data is a header with !BDN as the first 4 chars.
The prepended junk from what I've seen can be roughly from 10-500 binary octets (chars). It is sort of like ram slack, but at the start of the files. (No idea how that happened).
If I knew for certain that the binary junk didn't have any newlines in it, this sed script would get rid of the junk:
find . -name \*.pst -exec sed -e '1s/^.*!BDN/!BDN/' -i "{}" \;
I know I can write a program to do the same but working in binary and not worrying about intervening newlines.
Is there a relatively straight forward way to accomplish the above?
fyi: I'm going to try and get replacement uncorrupted data files as well, but that might be easier said than done.
# Create a binary testfile file with a '!BDN' marker. $ { cat /usr/bin/cat; printf '!BDN'; cat /usr/bin/cat; } > testfile
# Use awk(1) to print everything after and including the marker. awk 'BEGIN{RS="!BDN";ORS=""} {if(n++)print}' < testfile > testfile2
Ecellent :)
Have a nice day, Berny
You just made mine! -- To unsubscribe, e-mail: opensuse+unsubscribe@opensuse.org To contact the owner, e-mail: opensuse+owner@opensuse.org
On 08/12/2018 01:00 PM, Bernhard Voelker wrote:
# Create a binary testfile file with a '!BDN' marker. $ { cat /usr/bin/cat; printf '!BDN'; cat /usr/bin/cat; } > testfile
# Use awk(1) to print everything after and including the marker. awk 'BEGIN{RS="!BDN";ORS=""} {if(n++)print}' < testfile > testfile2
Or use 'grep --byte-offset --binary-files=text --only-matching --fixed-strings' to find the marker (stripping off the matched pattern after and including the ':'): $ n="$( grep -baoF '!BDN' testfile | cut -d: -f1 )" and then let dd copy all from that position: $ dd \ if=testfile \ of=testfile2-dd \ status=none \ bs=1 \ iflag=skip_bytes \ skip="$n" Have a nice day, Berny -- To unsubscribe, e-mail: opensuse+unsubscribe@opensuse.org To contact the owner, e-mail: opensuse+owner@opensuse.org
participants (7)
-
Bernhard Voelker
-
Dave Howorth
-
David C. Rankin
-
David Haller
-
Greg Freemyer
-
Karl Sinn
-
L A Walsh