What | Removed | Added |
---|---|---|
Status | NEW | RESOLVED |
CC | matz@suse.com | |
Resolution | --- | FIXED |
POSIX regexp character ranges are defined to be linguistic and not byte-valued. Hence something like [A-Z] contains all collation elements between the two endpoints A and Z. The locale "cs_CZ" (and for instance pl_PL and lt_LT, but not for example en_US or de_DE) define the collating order such that miniscule letters come in between their capital variants, like so: aAbB ... xXyYzZ Due to further complications in czech having to do with multi-character sequences the range A-X doesn't include the small x (even though it would normally given the above sorting order). But of course A-Y and hence A-Z do include the small x. If you want to see some more funny things do this: % echo chemie | LANG=cs_CZ.UTF-8 grep -E '^[^x]emie' chemie (doesn't match in most other locales). Reason: POSIX regexps match on a collating sequence (mapping of the input to collating elements), and 'ch' is a single collating element (to sort it between h and i), and hence is matched by the single character regexp [^x]. (This works only with POSIX compliant regexp engines, others will usually regard 'ch' as two characters). In short: (POSIX) character ranges and character classes are i18n'ed. They don't do simple ASCII like byte ranges or mapping. If you need that you have to use LC_COLLATE=C (and for mapping also LC_CTYPE=C). See for instance https://www.regular-expressions.info/posixbrackets.html Therefore: works as expected, INVALID.