http://bugzilla.novell.com/show_bug.cgi?id=1064519
http://bugzilla.novell.com/show_bug.cgi?id=1064519#c1
Michael Matz changed:
What |Removed |Added
----------------------------------------------------------------------------
Status|NEW |RESOLVED
CC| |matz@suse.com
Resolution|--- |FIXED
--- Comment #1 from Michael Matz ---
POSIX regexp character ranges are defined to be linguistic and not byte-valued.
Hence something like [A-Z] contains all collation elements between the two
endpoints A and Z.
The locale "cs_CZ" (and for instance pl_PL and lt_LT, but not for example en_US
or de_DE) define the collating order such that miniscule letters come in
between
their capital variants, like so:
aAbB ... xXyYzZ
Due to further complications in czech having to do with multi-character
sequences the range A-X doesn't include the small x (even though it would
normally given the above sorting order). But of course A-Y and hence A-Z do
include the small x.
If you want to see some more funny things do this:
% echo chemie | LANG=cs_CZ.UTF-8 grep -E '^[^x]emie'
chemie
(doesn't match in most other locales). Reason: POSIX regexps match
on a collating sequence (mapping of the input to collating elements), and 'ch'
is a single collating element (to sort it between h and i), and hence is
matched
by the single character regexp [^x].
(This works only with POSIX compliant regexp engines, others will usually
regard 'ch' as two characters).
In short: (POSIX) character ranges and character classes are i18n'ed. They
don't do simple ASCII like byte ranges or mapping. If you need that you have
to use LC_COLLATE=C (and for mapping also LC_CTYPE=C).
See for instance https://www.regular-expressions.info/posixbrackets.html
Therefore: works as expected, INVALID.
--
You are receiving this mail because:
You are on the CC list for the bug.