Re: [opensuse] Need new source for unix utils -- gnu has broken another. (fwd)
On Fri, 2 Feb 2018, Linda Walsh wrote:
I've used grep to search for strings across all my mailboxes for decades. Found out today, it randomly doesn't work based on whether or not the file contains any text that doesn't comply with POSIX. ... I suppose no one else really does a quick search through all their email this way any more. Though is this what you'd expect?
This worried me since I use grep to search through mail archives. 42.3 includes grep 2.16 dated 2014-01-01. man grep for 2.16 says under the heading ENVIRONMENT VARIABLES: POSIXLY_CORRECT If set, grep behaves as POSIX requires; otherwise, grep behaves more like other GNU programs. The latest grep, 3.1, dated 2017-07-02, contains the same statement. Do you have POSIXLY_CORRECT set? If not, I would not expect to see grep enforcing POSIX. Roger -- To unsubscribe, e-mail: opensuse+unsubscribe@opensuse.org To contact the owner, e-mail: opensuse+owner@opensuse.org
Roger Price wrote:
On Fri, 2 Feb 2018, Linda Walsh wrote:
I've used grep to search for strings across all my mailboxes for decades. Found out today, it randomly doesn't work based on whether or not the file contains any text that doesn't comply with POSIX. ... I suppose no one else really does a quick search through all their email this way any more. Though is this what you'd expect?
This worried me since I use grep to search through mail archives. 42.3 includes grep 2.16 dated 2014-01-01. man grep for 2.16 says under the heading ENVIRONMENT VARIABLES:
POSIXLY_CORRECT If set, grep behaves as POSIX requires; otherwise, grep behaves more like other GNU programs.
The latest grep, 3.1, dated 2017-07-02, contains the same statement.
Do you have POSIXLY_CORRECT set? If not, I would not expect to see grep enforcing POSIX.
Bingo. This was what I said -- (I DO NOT have POSIXLY_CORRECT set in my ENV). I pointed this out. I submitted this as a bug against grep only to have it closed because the email sent to me doesn't conform to POSIX! Specifically, see the "***"d statements by Eric Blake, below, and my response to his closing out the bug... -------- Original Message -------- Subject: Re: bug#30326: grep not searching through a text file (thinking it binary) Date: Fri, 02 Feb 2018 12:09:23 -0800 From: L A Walsh <gnu@tlinx.org> CC: 30326-done@debbugs.gnu.org, GNU bug control <control@debbugs.gnu.org> References: <5A74BC3F.1030401@tlinx.org> <2c00563c-9347-c596-4ade-a87bd9262ca1@redhat.com> Eric Blake wrote:
tag 30326 notabug
On 02/02/2018 01:30 PM, L. A. Walsh wrote:
I've used grep to search through my mbox-format emails for decades, but I've run into a case where it seems to be ignore a text mailbox because, I guess, it thinks it is "binary"
Yes, that's correct.
If I used "-Par" it finds it.
Yes, that's also correct.
It seems that grep believes the file to binary and ignores it, though "file" calls it "text".
The file is conditionally text. The POSIX definition of a text file is one whose lines consist of valid characters in the current locale - but note this definition is locale-dependent! So a file that is text under one locale may be binary under another. When you are grepping a file encoded correctly for the current locale, you get the output you want; when you are grepping a file that contains encoding errors for the current locale, POSIX says behavior is undefined, so GNU grep warns you that the file is binary (in the current locale); and your use of -a tells grep to process it anyways. As 'file' reported that your file was using non-ISO extended-ASCII, it probable means the file was encoded for an 8-bit single-byte locale; and my guess is that you were running grep under a UTF-8 locale, and generally, UTF-8 treats 8-bit single-byte inputs as encoding errors. Hence the warning that your file is binary, under the current locale.
You can also use 'LC_ALL=C grep' to force a locale where EVERY byte is a valid character, and thus where you will never encounter encoding errors (you may encounter OTHER things that make your file binary, such as embedded NULs, but that's a different matter).
*** This behavior is documented and intentional, so I'm closing this as not *** a bug in the tracker. However, feel free to add further comments or *** questions to the thread.
And perhaps we could tweak the grep diagnostics to clarify whether a file is binary because NUL bytes were encountered, vs. a file is binary because encoding errors were encountered.
Grep was around long before POSIX, as were most of the unix utils. Grep was able to find text strings in mboxes without a POSIX definition telling it that it was "broken". I don't want it displaying random binary that throws my terminal into weird modes, which is why I skip binary files. To have grep searching through some mailboxes while skipping others, randomly based on what email happens to be in the box at the time, is hardly a useful utility. I did not ask for POSIXLY_CORRECT -- if you need to have it be POSIXLY Correct, then use the existing var, but grep is now broken -- since POSIX doesn't define "text" files "out in the real world", but only for files that adhere to the POSIX standard. People don't write emails that adhere to the POSIX standard. Also, FWIW, grep's manpage doesn't say it is limited to posix-only files. It's summary says: grep, egrep, fgrep - print lines matching a pattern which it does not do. It doesn't say "print lines matching a pattern only from POSIX text files. ------END "Original Message" -------- -- To unsubscribe, e-mail: opensuse+unsubscribe@opensuse.org To contact the owner, e-mail: opensuse+owner@opensuse.org
On Sun, 4 Feb 2018, L A Walsh wrote:
Eric Blake wrote:
The file is conditionally text. The POSIX definition of a text file is one whose lines consist of valid characters in the current locale
I have spent enough time working on various standards to understand that many of them are more or less broken, and that reference implementations [1] appear which effectively define what the Standard in understood to mean. POSIX (IEEE 1003, ISO/IEC 9945) is no exception. The IEEE committee, the "Austin Common Standards Revision Group" proposes that interested parties become members of the working group. https://www.opengroup.org/austin/ « Since 1998 the standard has been developed by the Austin Group, an open working group found at http://www.opengroup.org/austin/. Participation is free and open to all interested parties (you just need to join the mailing list). » They also offer a bug report mechanism: « To submit a bug report against the specifications please use the Mantis online defect tracker at http://austingroupbugs.net » Perhaps you should report the broken POSIX definition of a text file to the Austin Group. Roger [1] An excellent example is James Clark's SGML parser.
participants (2)
-
L A Walsh
-
Roger Price