All, I have a huge text file (1.7 million lines) full of unicode and ascii text (half and half). note: disk space for copies is not a problem, if I need to manipulate this file Also, I have a 30 line file full of ascii text. I need to search the large file for any occurrences of the keywords in the 30 line file. Ignoring the unicode issue, I could use grep (of fgrep, egrep) with appropriate args. I have no idea how to handle the unicode issue. Any suggestions? Thanks Greg -- Greg Freemyer Litigation Triage Solutions Specialist http://www.linkedin.com/in/gregfreemyer First 99 Days Litigation White Paper - http://www.norcrossgroup.com/forms/whitepapers/99%20Days%20whitepaper.pdf The Norcross Group The Intersection of Evidence & Technology http://www.norcrossgroup.com -- To unsubscribe, e-mail: opensuse+unsubscribe@opensuse.org For additional commands, e-mail: opensuse+help@opensuse.org
Greg Freemyer wrote:
All,
I have a huge text file (1.7 million lines) full of unicode and ascii text (half and half).
note: disk space for copies is not a problem, if I need to manipulate this file
Also, I have a 30 line file full of ascii text.
I need to search the large file for any occurrences of the keywords in the 30 line file.
Ignoring the unicode issue, I could use grep (of fgrep, egrep) with appropriate args.
I have no idea how to handle the unicode issue.
The same way, but be prepared to use /123-isms
Any suggestions?
handle what, specifically (i'm not sure what you mean -- grepping unicode or converting it to ASCII or what)?
Thanks Greg
-- To unsubscribe, e-mail: opensuse+unsubscribe@opensuse.org For additional commands, e-mail: opensuse+help@opensuse.org
Greg Freemyer wrote:
All,
I have a huge text file (1.7 million lines) full of unicode and ascii text (half and half).
note: disk space for copies is not a problem, if I need to manipulate this file
Also, I have a 30 line file full of ascii text.
I need to search the large file for any occurrences of the keywords in the 30 line file.
Ignoring the unicode issue, I could use grep (of fgrep, egrep) with appropriate args.
I have no idea how to handle the unicode issue.
ASCII is a subset of UTF-8, so if your Unicode coding is UTF-8 there should be no problem. In other case I wonder how did you manage to create file with interspersed single and (fixed size) multibyte encodings... I guess you would have to split the file somehow. Best regards Petr -- To unsubscribe, e-mail: opensuse+unsubscribe@opensuse.org For additional commands, e-mail: opensuse+help@opensuse.org
On Feb 19, 2008 8:01 AM, Petr Cerny
Greg Freemyer wrote:
All,
I have a huge text file (1.7 million lines) full of unicode and ascii text (half and half).
note: disk space for copies is not a problem, if I need to manipulate this file
Also, I have a 30 line file full of ascii text.
I need to search the large file for any occurrences of the keywords in the 30 line file.
Ignoring the unicode issue, I could use grep (of fgrep, egrep) with appropriate args.
I have no idea how to handle the unicode issue.
ASCII is a subset of UTF-8, so if your Unicode coding is UTF-8 there should be no problem. In other case I wonder how did you manage to create file with interspersed single and (fixed size) multibyte encodings... I guess you would have to split the file somehow.
Best regards Petr
You can see I'm a bit confused. I have not worked much with Unicode, and not at all on Linux. To me the Unicode file looks like normal ascii with a null between every byte. I think thats normal from my limited exposure to Unicode. Status: I'm not at the office, but they recreated the 1.7 million line file last night as pure Unicode. I can't test it for a couple hours, but is the claim: I can use grep to search a Unicode file for a ascii text pattern? Yesterday I looked for a grep switch that would tell it the file being searched was Unicode but I didn't see one. I guess it could be automatic? I did not actually attempt it. If that is not feasible, a tool to convert that Unicode file into a ascii file would be great. All the text I am interested in should convert to ascii in the obvious way. No need for any mappings. Basically, just need to get rid of the nulls. Greg -- Greg Freemyer Litigation Triage Solutions Specialist http://www.linkedin.com/in/gregfreemyer First 99 Days Litigation White Paper - http://www.norcrossgroup.com/forms/whitepapers/99%20Days%20whitepaper.pdf The Norcross Group The Intersection of Evidence & Technology http://www.norcrossgroup.com -- To unsubscribe, e-mail: opensuse+unsubscribe@opensuse.org For additional commands, e-mail: opensuse+help@opensuse.org
Greg Freemyer wrote:
Greg Freemyer wrote: You can see I'm a bit confused. I have not worked much with Unicode, and not at all on Linux. To me the Unicode file looks like normal ascii with a null between every byte. I think thats normal from my
On Feb 19, 2008 8:01 AM, Petr Cerny
wrote: limited exposure to Unicode. Status: I'm not at the office, but they recreated the 1.7 million line file last night as pure Unicode.
This is good.
I can't test it for a couple hours, but is the claim:
I can use grep to search a Unicode file for a ascii text pattern?
Very likely no. Covert it and then grep (see below).
Yesterday I looked for a grep switch that would tell it the file being searched was Unicode but I didn't see one. I guess it could be automatic? I did not actually attempt it.
If that is not feasible, a tool to convert that Unicode file into a ascii file would be great. All the text I am interested in should convert to ascii in the obvious way. No need for any mappings. Basically, just need to get rid of the nulls.
iconv is your friend. Basically you would need to do something like: 'iconv -f some_unicode_encoding -t ascii < unicode_file > ascii_file' This will also choke on any non-ascii characters (unless you make it translate unknown characters with "//TRANSLIT" appended to output encoding). You are probably using UCS-2LE if the file has NULLs on even positions in the file or UCS-2BE for NULL on odd positions in the file. As for basic overview: Unicode is a standard for encoding national characters in multiple bytes. Several encodings (all are "unicode") exist: 1) UTF-8 which encodes characters in 1-4 bytes (e.g. 'a' takes 1 byte, while some chinese characters may need let's say 3 bytes); 2) UTF-16 which uses 2 or 4 bytes; 3) UCS-2LE, UCS-2BE which use 2 bytes per character, for ascii characters one of these bytes is always null (depending on whether it is Big Endian or Little Endian variation); 4) UCS-4/UTF-32 (which differ slighly) use 4 bytes per character. For more see e.g.: http://en.wikipedia.org/wiki/Unicode Best regards Petr -- To unsubscribe, e-mail: opensuse+unsubscribe@opensuse.org For additional commands, e-mail: opensuse+help@opensuse.org
On Tuesday 19 February 2008 20:27, Petr Cerny wrote:
iconv is your friend. Basically you would need to do something like: 'iconv -f some_unicode_encoding -t ascii < unicode_file > ascii_file' This will also choke on any non-ascii characters (unless you make it translate unknown characters with "//TRANSLIT" appended to output encoding).
For lots of files, I'd rather use "recode" with can operate on multiple files
in-place without the need for a temp file for each one:
recode unicode..utf8 *.txt
CU
--
Stefan Hundhammer
participants (4)
-
Aaron Kulkis
-
Greg Freemyer
-
Petr Cerny
-
Stefan Hundhammer