[opensuse-factory] collation bug in locales using UTF-8 (cur= a<A<b<B<z<Z; should be A<B<C<Z<a<b<c<z)

Linda Walsh

27 May 2012 27 May '12

20:07

It seems that Open SuSE suffers from this bug as well: https://bugs.launchpad.net/ubuntu/+source/bash/+bug/120687 Somewhere along the line, due to people paying attention to POSIX, they though they could change collation orders for any locality outside of the 'C' locality to anything they wanted. What they didn't realize is that Unicode also specifies a correlation order that is roughly (maybe exactly in the C range), equivalent to the C range. While localized character sets iso-8859-xx... and others might have different collating orders, for those who have UTF-8 as the default encoding, or more so, set encodings to lang_CO.UTF-8, the character set collation order should take precedence -- in __AT LEAST__, character ranges as used in Regex's (as those are 'pure characters and no words are involved -- then only the ordering of the character set should be used). Maybe this needs to be fixed in the gnu sorting functions? Unicode sources are quoted with links in the above bug... -- To unsubscribe, e-mail: opensuse-factory+unsubscribe@opensuse.org To contact the owner, e-mail: opensuse-factory+owner@opensuse.org

Show replies by date

Anders Johansson

27 May 27 May

21:30

On Sunday 27 May 2012 13:07:13 Linda Walsh wrote:

...

It seems that Open SuSE suffers from this bug as well:

https://bugs.launchpad.net/ubuntu/+source/bash/+bug/120687

Somewhere along the line, due to people paying attention to POSIX, they though they could change collation orders for any locality outside of the 'C' locality to anything they wanted.

What they didn't realize is that Unicode also specifies a correlation order that is roughly (maybe exactly in the C range), equivalent to the C range.

No it doesn't. In fact, the link in the bug explicitly says that there must be a way in the implementation of parametrizing the sort order, to account for national standards.

...

While localized character sets iso-8859-xx... and others might have different collating orders, for those who have UTF-8 as the default encoding, or more so, set encodings to lang_CO.UTF-8, the character set collation order should take precedence -- in __AT LEAST__, character ranges as used in Regex's (as those are 'pure characters and no words are involved -- then only the ordering of the character set should be used).

What are you saying here? At first you sound as though you are arguing for a single, standard sorting order across all countries, languages, alphabets in teh world, and here you seem to say it should be character set specific. In any case, there is no way you are ever going to get a standard sorting order for the whole world. Anders -- To unsubscribe, e-mail: opensuse-factory+unsubscribe@opensuse.org To contact the owner, e-mail: opensuse-factory+owner@opensuse.org

Linda Walsh

28 May 28 May

00:11

New subject: [opensuse-factory] Re: collation bug in locales using UTF-8 (cur= a<A<b<B<z<Z; should be A<B<C<Z<a<b<c<z)

Anders Johansson wrote:

...

What are you saying here? At first you sound as though you are arguing for a single, standard sorting order across all countries, languages, alphabets in teh world, and here you seem to say it should be character set specific.

--- That's the idea of unicode -- that there would be 1 large encoding to hold all of the worlds symbols. That started over a decade ago. That they would also specific collation order for each covered alphabet should be no big surprise. They also cover capitalization rules among other features.

...

In any case, there is no way you are ever going to get a standard sorting order for the whole world.

------------- There already is one. The Unicode specification already specifies collation order for all the languages it covers. I'm pointing out that for the Latin character set the A-Z sort before a-z. This isn't the case under SuSE. -- To unsubscribe, e-mail: opensuse-factory+unsubscribe@opensuse.org To contact the owner, e-mail: opensuse-factory+owner@opensuse.org

Jan Engelhardt

08:06

New subject: [opensuse-factory] Re: collation bug in locales using UTF-8 (cur= a<A<b<B<z<Z; should be A<B<C<Z<a<b<c<z)

On Monday 2012-05-28 02:11, Linda Walsh wrote:

...

Anders Johansson wrote:

...
In any case, there is no way you are ever going to get a standard sorting order for the whole world.

There already is one.

The Unicode specification already specifies collation order for all the languages it covers.

I'm pointing out that for the Latin character set the A-Z sort before a-z. This isn't the case under SuSE.

The collation order for many locales with a Latin-like character set is indeed Aa..Zz or aA..zZ. Phone books, word dictionaries - they all work like that. Observe:

...

LC_COLLATE=de_DE.UTF-8 ls a A å Å ä Ä æ Æ b B ö Ö ø Ø ß w W z Z LC_COLLATE=nb_NO.UTF-8 ls A a B b ß W w Z z Æ æ Ä ä Ø ø Ö ö Å å LC_COLLATE=sv_SE.UTF-8 ls a A b B ß w W z Z å Å ä Ä æ Æ ö Ö ø Ø

Collating in A..Za..z fashion is reserved for the "POSIX" locale. If you want that, set it. export LC_COLLATE=POSIX or in /etc/sysconfig/language RC_LC_COLLATE. -- To unsubscribe, e-mail: opensuse-factory+unsubscribe@opensuse.org To contact the owner, e-mail: opensuse-factory+owner@opensuse.org

Guido Berhoerster

08:07

New subject: [opensuse-factory] Re: collation bug in locales using UTF-8 (cur= a<A<b<B<z<Z; should be A<B<C<Z<a<b<c<z)

On 28.05.2012 02:11, Linda Walsh wrote:

...

The Unicode specification already specifies collation order for all the languages it covers.

I'm pointing out that for the Latin character set the A-Z sort before a-z.

No, they don't and AFAICS glibc implements the Unicode collation algorithm. -- Guido Berhoerster -- To unsubscribe, e-mail: opensuse-factory+unsubscribe@opensuse.org To contact the owner, e-mail: opensuse-factory+owner@opensuse.org

Anders Johansson

08:18

New subject: [opensuse-factory] Re: collation bug in locales using UTF-8 (cur= a<A<b<B<z<Z; should be A<B<C<Z<a<b<c<z)

On Sunday 27 May 2012 17:11:13 Linda Walsh wrote:

...

There already is one.

The Unicode specification already specifies collation order for all the languages it covers.

I'm pointing out that for the Latin character set the A-Z sort before a-z.

This is not true, as I already mentioned. The unicode doc linked to from the bug you mentioned explicitly says this can never happen. It talks about how different regions have different ideas about sort order and how any new character set standard implementation will need to respect this and not try to impose a world wide standard. I already mentioned this in my previous email. Why did you ignore that? Anders -- To unsubscribe, e-mail: opensuse-factory+unsubscribe@opensuse.org To contact the owner, e-mail: opensuse-factory+owner@opensuse.org

4526

Age (days ago)

4527

Last active (days ago)

List overview

Download

5 comments

4 participants

participants (4)

Anders Johansson
Guido Berhoerster
Jan Engelhardt
Linda Walsh