Michael Matz changed bug 1064519
What Removed Added
Status NEW RESOLVED
CC   matz@suse.com
Resolution --- FIXED

Comment # 1 on bug 1064519 from
POSIX regexp character ranges are defined to be linguistic and not byte-valued.
Hence something like [A-Z] contains all collation elements between the two
endpoints A and Z.

The locale "cs_CZ" (and for instance pl_PL and lt_LT, but not for example en_US
or de_DE) define the collating order such that miniscule letters come in
between
their capital variants, like so:

aAbB ... xXyYzZ

Due to further complications in czech having to do with multi-character
sequences the range A-X doesn't include the small x (even though it would
normally given the above sorting order).  But of course A-Y and hence A-Z do
include the small x.

If you want to see some more funny things do this:

% echo chemie | LANG=cs_CZ.UTF-8 grep -E '^[^x]emie'
chemie

(doesn't match in most other locales).  Reason: POSIX regexps match
on a collating sequence (mapping of the input to collating elements), and 'ch'
is a single collating element (to sort it between h and i), and hence is
matched
by the single character regexp [^x].

(This works only with POSIX compliant regexp engines, others will usually
regard 'ch' as two characters).

In short: (POSIX) character ranges and character classes are i18n'ed.  They
don't do simple ASCII like byte ranges or mapping.  If you need that you have
to use LC_COLLATE=C (and for mapping also LC_CTYPE=C).

See for instance https://www.regular-expressions.info/posixbrackets.html

Therefore: works as expected, INVALID.


You are receiving this mail because: