[Bug 1064519] echo xxx |grep '[A-Z]' returns xxx with cs_CZ locale

23 Oct 2017

      http://bugzilla.novell.com/show_bug.cgi?id=1064519
http://bugzilla.novell.com/show_bug.cgi?id=1064519#c1

Michael Matz  changed:

           What    |Removed                     |Added
----------------------------------------------------------------------------
             Status|NEW                         |RESOLVED
                 CC|                            |matz@suse.com
         Resolution|---                         |FIXED

--- Comment #1 from Michael Matz  ---
POSIX regexp character ranges are defined to be linguistic and not byte-valued.
Hence something like [A-Z] contains all collation elements between the two
endpoints A and Z.

The locale "cs_CZ" (and for instance pl_PL and lt_LT, but not for example en_US
or de_DE) define the collating order such that miniscule letters come in
between
their capital variants, like so:

aAbB ... xXyYzZ

Due to further complications in czech having to do with multi-character
sequences the range A-X doesn't include the small x (even though it would
normally given the above sorting order).  But of course A-Y and hence A-Z do
include the small x.

If you want to see some more funny things do this:

% echo chemie | LANG=cs_CZ.UTF-8 grep -E '^[^x]emie'
chemie

(doesn't match in most other locales).  Reason: POSIX regexps match
on a collating sequence (mapping of the input to collating elements), and 'ch'
is a single collating element (to sort it between h and i), and hence is
matched
by the single character regexp [^x].

(This works only with POSIX compliant regexp engines, others will usually
regard 'ch' as two characters).

In short: (POSIX) character ranges and character classes are i18n'ed.  They
don't do simple ASCII like byte ranges or mapping.  If you need that you have
to use LC_COLLATE=C (and for mapping also LC_CTYPE=C).

See for instance https://www.regular-expressions.info/posixbrackets.html

Therefore: works as expected, INVALID.

-- 
You are receiving this mail because:
You are on the CC list for the bug.

[Bug 1064519] echo xxx |grep '[A-Z]' returns xxx with cs_CZ locale

bugzilla_noreply＠novell.com