[opensuse-factory] collation bug in locales using UTF-8 (cur= a<A<b<B<z<Z; should be A<B<C<Z<a<b<c<z)
It seems that Open SuSE suffers from this bug as well: https://bugs.launchpad.net/ubuntu/+source/bash/+bug/120687 Somewhere along the line, due to people paying attention to POSIX, they though they could change collation orders for any locality outside of the 'C' locality to anything they wanted. What they didn't realize is that Unicode also specifies a correlation order that is roughly (maybe exactly in the C range), equivalent to the C range. While localized character sets iso-8859-xx... and others might have different collating orders, for those who have UTF-8 as the default encoding, or more so, set encodings to lang_CO.UTF-8, the character set collation order should take precedence -- in __AT LEAST__, character ranges as used in Regex's (as those are 'pure characters and no words are involved -- then only the ordering of the character set should be used). Maybe this needs to be fixed in the gnu sorting functions? Unicode sources are quoted with links in the above bug... -- To unsubscribe, e-mail: opensuse-factory+unsubscribe@opensuse.org To contact the owner, e-mail: opensuse-factory+owner@opensuse.org
On Sunday 27 May 2012 13:07:13 Linda Walsh wrote:
It seems that Open SuSE suffers from this bug as well:
https://bugs.launchpad.net/ubuntu/+source/bash/+bug/120687
Somewhere along the line, due to people paying attention to POSIX, they though they could change collation orders for any locality outside of the 'C' locality to anything they wanted.
What they didn't realize is that Unicode also specifies a correlation order that is roughly (maybe exactly in the C range), equivalent to the C range.
No it doesn't. In fact, the link in the bug explicitly says that there must be a way in the implementation of parametrizing the sort order, to account for national standards.
While localized character sets iso-8859-xx... and others might have different collating orders, for those who have UTF-8 as the default encoding, or more so, set encodings to lang_CO.UTF-8, the character set collation order should take precedence -- in __AT LEAST__, character ranges as used in Regex's (as those are 'pure characters and no words are involved -- then only the ordering of the character set should be used).
What are you saying here? At first you sound as though you are arguing for a single, standard sorting order across all countries, languages, alphabets in teh world, and here you seem to say it should be character set specific. In any case, there is no way you are ever going to get a standard sorting order for the whole world. Anders -- To unsubscribe, e-mail: opensuse-factory+unsubscribe@opensuse.org To contact the owner, e-mail: opensuse-factory+owner@opensuse.org
Anders Johansson wrote:
What are you saying here? At first you sound as though you are arguing for a single, standard sorting order across all countries, languages, alphabets in teh world, and here you seem to say it should be character set specific.
--- That's the idea of unicode -- that there would be 1 large encoding to hold all of the worlds symbols. That started over a decade ago. That they would also specific collation order for each covered alphabet should be no big surprise. They also cover capitalization rules among other features.
In any case, there is no way you are ever going to get a standard sorting order for the whole world.
------------- There already is one. The Unicode specification already specifies collation order for all the languages it covers. I'm pointing out that for the Latin character set the A-Z sort before a-z. This isn't the case under SuSE. -- To unsubscribe, e-mail: opensuse-factory+unsubscribe@opensuse.org To contact the owner, e-mail: opensuse-factory+owner@opensuse.org
On Monday 2012-05-28 02:11, Linda Walsh wrote:
Anders Johansson wrote:
In any case, there is no way you are ever going to get a standard sorting order for the whole world.
There already is one.
The Unicode specification already specifies collation order for all the languages it covers.
I'm pointing out that for the Latin character set the A-Z sort before a-z. This isn't the case under SuSE.
The collation order for many locales with a Latin-like character set is indeed Aa..Zz or aA..zZ. Phone books, word dictionaries - they all work like that. Observe:
LC_COLLATE=de_DE.UTF-8 ls a A å Å ä Ä æ Æ b B ö Ö ø Ø ß w W z Z LC_COLLATE=nb_NO.UTF-8 ls A a B b ß W w Z z Æ æ Ä ä Ø ø Ö ö Å å LC_COLLATE=sv_SE.UTF-8 ls a A b B ß w W z Z å Å ä Ä æ Æ ö Ö ø Ø
Collating in A..Za..z fashion is reserved for the "POSIX" locale. If you want that, set it. export LC_COLLATE=POSIX or in /etc/sysconfig/language RC_LC_COLLATE. -- To unsubscribe, e-mail: opensuse-factory+unsubscribe@opensuse.org To contact the owner, e-mail: opensuse-factory+owner@opensuse.org
On 28.05.2012 02:11, Linda Walsh wrote:
The Unicode specification already specifies collation order for all the languages it covers.
I'm pointing out that for the Latin character set the A-Z sort before a-z.
No, they don't and AFAICS glibc implements the Unicode collation algorithm. -- Guido Berhoerster -- To unsubscribe, e-mail: opensuse-factory+unsubscribe@opensuse.org To contact the owner, e-mail: opensuse-factory+owner@opensuse.org
On Sunday 27 May 2012 17:11:13 Linda Walsh wrote:
There already is one.
The Unicode specification already specifies collation order for all the languages it covers.
I'm pointing out that for the Latin character set the A-Z sort before a-z.
This is not true, as I already mentioned. The unicode doc linked to from the bug you mentioned explicitly says this can never happen. It talks about how different regions have different ideas about sort order and how any new character set standard implementation will need to respect this and not try to impose a world wide standard. I already mentioned this in my previous email. Why did you ignore that? Anders -- To unsubscribe, e-mail: opensuse-factory+unsubscribe@opensuse.org To contact the owner, e-mail: opensuse-factory+owner@opensuse.org
participants (4)
-
Anders Johansson
-
Guido Berhoerster
-
Jan Engelhardt
-
Linda Walsh