[opensuse-packaging] RFC: changing RPM's default scriptlet locale
Fellow mortals, more and more packages need their locale to be set to something more sensible than C. This hit me while switching packages over to Python 3. Python gets its encoding from locale, so by default, it won't decode UTF-8 unless the appropriate encoding is set. Right now, when gtk-doc is switched to use Python 3, it won't build UTF-8 documentation. This could be changed in gtk-doc itself (although that is impractical), but perhaps a better way is to change the default locale for spec scriptlets. Right now the macros set it to C. We could switch that to "C.UTF-8", "en_US.UTF-8", or export a special RPM variable that could be overriden in your spec file. Still, the default should be something with UTF-8 in it. I'm now trying to build a Ring 1 staging project with this change. So far I have seen one failure related to it: with en_US.UTF-8, bash ranges (like [a-z]) are case-insensitive and match more than intended. That could be solved by changing the expression, or by setting locale to C.UTF-8. Thoughts, comments? regards m.
On Fri, 2017-10-27 at 16:13 +0200, jan matejek wrote:
This could be changed in gtk-doc itself (although that is impractical), but perhaps a better way is to change the default locale for spec scriptlets. Right now the macros set it to C. We could switch that to "C.UTF-8", "en_US.UTF-8", or export a special RPM variable that could be overriden in your spec file. Still, the default should be something with UTF-8 in it.
I support this overall approach of getting it fixed 'high up'. One thing that strikes generally as 'odd' (looking at gtk-doc) is that the python2 variant seemed not to have trouble parsing the same files - but switching to py3 causes trouble. Is python3 itself expecting a .UTF8 locale to be reliably usable? Cheers Dominique
On Fri, Oct 27, 2017 at 11:03 AM, Dominique Leuenberger / DimStar <dimstar@opensuse.org> wrote:
On Fri, 2017-10-27 at 16:13 +0200, jan matejek wrote:
This could be changed in gtk-doc itself (although that is impractical), but perhaps a better way is to change the default locale for spec scriptlets. Right now the macros set it to C. We could switch that to "C.UTF-8", "en_US.UTF-8", or export a special RPM variable that could be overriden in your spec file. Still, the default should be something with UTF-8 in it.
I support this overall approach of getting it fixed 'high up'.
One thing that strikes generally as 'odd' (looking at gtk-doc) is that the python2 variant seemed not to have trouble parsing the same files - but switching to py3 causes trouble.
Is python3 itself expecting a .UTF8 locale to be reliably usable?
Yes, it is. Actually, the main reason for RPM itself not automatically exporting C.UTF-8 by default is that the actual support for this hasn't been merged into glibc. It's a patch that's carried only by Fedora, openSUSE, and Debian. Upstream, we've been having this discussion in one of the pull requests for RPM: https://github.com/rpm-software-management/rpm/pull/227 Personally speaking, I really wish someone would champion this to get into glibc properly so that we can reliably depend on it. -- 真実はいつも一つ!/ Always, there's only one truth! -- To unsubscribe, e-mail: opensuse-packaging+unsubscribe@opensuse.org To contact the owner, e-mail: opensuse-packaging+owner@opensuse.org
On Friday 2017-10-27 17:03, Dominique Leuenberger / DimStar wrote:
On Fri, 2017-10-27 at 16:13 +0200, jan matejek wrote:
This could be changed in gtk-doc itself (although that is impractical), but perhaps a better way is to change the default locale for spec scriptlets. Right now the macros set it to C. We could switch that to "C.UTF-8", "en_US.UTF-8", or export a special RPM variable that could be overriden in your spec file. Still, the default should be something with UTF-8 in it.
One thing that strikes generally as 'odd' (looking at gtk-doc) is that the python2 variant seemed not to have trouble parsing the same files -
I see no deviation in python2 behavior. Without PEP-0263 markers to give a 1st class decision, the parser only has 2nd-class choices to make — and Python decides not to take 2nd choices, unlike Perl. # leap 42.2 17:42 zap:/dev/shm > python2 x.py File "x.py", line 2 SyntaxError: Non-ASCII character '\xc3' in file x.py on line 2, but no encoding declared; see http://python.org/dev/peps/pep-0263/ for details 17:42 zap:/dev/shm > cat x.py #!/usr/bin/python print "föhn\n" -- To unsubscribe, e-mail: opensuse-packaging+unsubscribe@opensuse.org To contact the owner, e-mail: opensuse-packaging+owner@opensuse.org
On Fri, 2017-10-27 at 17:43 +0200, Jan Engelhardt wrote:
# leap 42.2 17:42 zap:/dev/shm > python2 x.py File "x.py", line 2 SyntaxError: Non-ASCII character '\xc3' in file x.py on line 2, but no encoding declared; see http://python.org/dev/peps/pep-0263/ for details 17:42 zap:/dev/shm > cat x.py #!/usr/bin/python print "föhn\n"
Not exactly the issue we are seeing... It's not about the python file itself, but the files a python script might be opening and inspecting. Cheers Dominique
On Freitag, 27. Oktober 2017 17:03:58 CEST Dominique Leuenberger / DimStar wrote:
On Fri, 2017-10-27 at 16:13 +0200, jan matejek wrote:
This could be changed in gtk-doc itself (although that is impractical), but perhaps a better way is to change the default locale for spec scriptlets. Right now the macros set it to C. We could switch that to "C.UTF-8", "en_US.UTF-8", or export a special RPM variable that could be overriden in your spec file. Still, the default should be something with UTF-8 in it.
I support this overall approach of getting it fixed 'high up'.
One thing that strikes generally as 'odd' (looking at gtk-doc) is that the python2 variant seemed not to have trouble parsing the same files - but switching to py3 causes trouble.
Is python3 itself expecting a .UTF8 locale to be reliably usable?
Reading external string data without an explicit encoding is just plain wrong, both in python2 and in python3, and in any other language. If the program knows some input is UTF-8 (by specification), it should set the encoding to 'utf-8', otherwise it should determine the encoding (e.g. XML allows the specification of the encoding). When communicating with external programs (e.g. using pipes), the calling program should tell the called program which encoding to use (noop if the other program uses a fixed encoding), either via commandline switch or by explicit setting of the locale. Kind regards, Stefan -- To unsubscribe, e-mail: opensuse-packaging+unsubscribe@opensuse.org To contact the owner, e-mail: opensuse-packaging+owner@opensuse.org
On Fri, 2017-10-27 at 16:13 +0200, jan matejek wrote:
That could be solved by changing the expression, or by setting locale to C.UTF-8.
For the record, just to have it mentioned: %meson exports LANG=C.UTF-8 As meson is also a python3 app, it 'suffered' from the same issue that it needs a UTF-8 locale configured, but it was 'simple enough' to make this in the globally used macros. If we get RPM's own macros to use C.UTF-8, we would of course strip this from the meson-macros again. Cheers Dominique
On Friday 2017-10-27 16:13, jan matejek wrote:
more and more packages need their locale to be set to something more sensible than C. This hit me while switching packages over to Python 3. Python gets its encoding from locale, so by default, it won't decode UTF-8 unless the appropriate encoding is set.
I am not convinced that changing the rpm default helps. FWIW, a large portion of source files could be ISO-8859-1 (Windows still is a thing, and a popular one at that) so that LC=UTF-8 on a global scale would not help. The only real fix is to use an in-file marker such that the file becomes self-describing, and there are sufficient examples in history how to pull that off: - Byte Order Mark to determine UTF-8, UTF-{16,32}{BE,LE} - <?xml encoding="..." ?> in XML - <meta> in HTML likewise - "use utf8" in Perl (or something similar to it) - "# -*- coding: utf-8 -*-" in Python (PEP-0263) So there you have it. If Python falls over on UTF-8 files (I know Perl would), then those source files should say they are UTF-8. And those that are ISO-8859-1 should say they are iso-8859-1. Keeping the locale at C would at least identify the important spots as python would stop execution. -- To unsubscribe, e-mail: opensuse-packaging+unsubscribe@opensuse.org To contact the owner, e-mail: opensuse-packaging+owner@opensuse.org
On 27.10.2017 17:41, Jan Engelhardt wrote:
I am not convinced that changing the rpm default helps. FWIW, a large portion of source files could be ISO-8859-1 (Windows still is a thing, and a popular one at that) so that LC=UTF-8 on a global scale would not help.
Source files are not the issue here though. (and FWIW Python 3 compatible source files tend to be modern enough to either a) be UTF8 or b) have the PEP263 encoding header) The issue is that Python 3 needs to know the encoding of *all* external inputs (e.g., documentation files to be parsed by a doc generator) because its stores strings as Unicode internally. So you'd need a BOM mark on *every* file potentially touched by Python, ever, and also somehow mark the encoding of stdin, stdout and such. (so maybe an environment variable? ;) )
So there you have it. If Python falls over on UTF-8 files (I know Perl would), then those source files should say they are UTF-8. And those that are ISO-8859-1 should say they are iso-8859-1.
Alternately we could say that UTF-8 is the distro default, and only non-default files must be marked. Which is what I'm proposing.
On Freitag, 27. Oktober 2017 17:41:23 CEST Jan Engelhardt wrote:
On Friday 2017-10-27 16:13, jan matejek wrote:
more and more packages need their locale to be set to something more sensible than C. This hit me while switching packages over to Python 3. Python gets its encoding from locale, so by default, it won't decode UTF-8 unless the appropriate encoding is set.
I am not convinced that changing the rpm default helps. FWIW, a large portion of source files could be ISO-8859-1 (Windows still is a thing, and a popular one at that) so that LC=UTF-8 on a global scale would not help.
The only real fix is to use an in-file marker such that the file becomes self-describing, and there are sufficient examples in history how to pull that off:
- Byte Order Mark to determine UTF-8, UTF-{16,32}{BE,LE}
Byte order mark is ambiguous - it is three bytes, which are valid codepoints in e.g. ISO-8859-1. Granted, it is unlikely, but ...
- <?xml encoding="..." ?> in XML - <meta> in HTML likewise - "use utf8" in Perl (or something similar to it) - "# -*- coding: utf-8 -*-" in Python (PEP-0263)
All these guarantee the content up to and including the encoding specification is plain ASCII, so this is completely unambiguous and sane.
So there you have it. If Python falls over on UTF-8 files (I know Perl would), then those source files should say they are UTF-8. And those that are ISO-8859-1 should say they are iso-8859-1.
Keeping the locale at C would at least identify the important spots as python would stop execution.
Seconded, most utf-8 documents are also valid when interpreted as iso-8859-x, so guessing is a bad idea. Kind regards, Stefan -- To unsubscribe, e-mail: opensuse-packaging+unsubscribe@opensuse.org To contact the owner, e-mail: opensuse-packaging+owner@opensuse.org
On Friday 2017-10-27 18:23, Brüns, Stefan wrote:
The only real fix is to use an in-file marker such that the file becomes self-describing, and there are sufficient examples in history how to pull that off:
- Byte Order Mark to determine UTF-8, UTF-{16,32}{BE,LE}
Byte order mark is ambiguous - it is three bytes, which are valid codepoints in e.g. ISO-8859-1. Granted, it is unlikely, but ...
The byte order mark is not ambiguous for what it was meant to do, since there is a bijective mapping between the domain of (defined) bit patterns and the codomain of (defined) encodings. ISO-8859-* is just not within the set. Understandably so, since ISO-8859-* does not __have__ a byte __order__ to begin with — it is a single-byte encoding. -- To unsubscribe, e-mail: opensuse-packaging+unsubscribe@opensuse.org To contact the owner, e-mail: opensuse-packaging+owner@opensuse.org
On Tuesday, 31 October 2017 8:11 Jan Engelhardt wrote:
On Friday 2017-10-27 18:23, Brüns, Stefan wrote:
Byte order mark is ambiguous - it is three bytes, which are valid codepoints in e.g. ISO-8859-1. Granted, it is unlikely, but ...
The byte order mark is not ambiguous for what it was meant to do, since there is a bijective mapping between the domain of (defined) bit patterns and the codomain of (defined) encodings. ISO-8859-* is just not within the set. Understandably so, since ISO-8859-* does not __have__ a byte __order__ to begin with ? it is a single-byte encoding.
Neither does UTF-8, that's why I always considered putting a "BOM" into a UTF-8 text a sign that someone doesn't know what they are doing and why. But I believe Stefan wanted to point out that UTF-8 "BOM" consists of three bytes which are valid ISO-8859-1 characters so that a document in ISO-8859-1 could, in theory, start with them (however unlikely that is). Michal Kubeček -- To unsubscribe, e-mail: opensuse-packaging+unsubscribe@opensuse.org To contact the owner, e-mail: opensuse-packaging+owner@opensuse.org
On Tuesday 2017-10-31 08:42, Michal Kubecek wrote:
ISO-8859-* does not __have__ a byte __order__ to begin with -- it is a single-byte encoding.
Neither does UTF-8
As a multibyte encoding, it *does* have an order, even if just a single defined one. Swapping two octets does not necessarily produce the same character value (U+xxxx). In ISO-8859, you can do this switch and you will get the same character values.
But I believe Stefan wanted to point out that UTF-8 "BOM" consists of three bytes which are valid ISO-8859-1 characters so that a document in ISO-8859-1 could, in theory, start with them (however unlikely that is).
The BOM certainly is no gold standard (it only knows 5 character "sets" anyway), but it at least indicates to an EBCDIC system that some UTF is coming, as it would otherwise have no chance to see the <meta> tag due to the different codepage assumption. Though that's mostly speculation. -- To unsubscribe, e-mail: opensuse-packaging+unsubscribe@opensuse.org To contact the owner, e-mail: opensuse-packaging+owner@opensuse.org
On Tuesday, 31 October 2017 9:48 Jan Engelhardt wrote:
On Tuesday 2017-10-31 08:42, Michal Kubecek wrote:
ISO-8859-* does not __have__ a byte __order__ to begin with -- it is a single-byte encoding.
Neither does UTF-8
As a multibyte encoding, it *does* have an order, even if just a single defined one. Swapping two octets does not necessarily produce the same character value (U+xxxx). In ISO-8859, you can do this switch and you will get the same character values.
If you wish to call it byte order, you certainly can. But unlike for e.g. UTF-16, there is no actual need for "BOM" in UTF-8 text. It's just a way some editors do to say "this is UTF-8" - but it doesn't actually work in general and confuses various parsers. The very idea of putting the information about encoding into the text itself is IMHO completely wrong as it can only work for a very limited set of encodings and only for limited number of applications processing the text documents. Such information must be specified outside the document itself, explicitly or implicitly. Michal Kubecek -- To unsubscribe, e-mail: opensuse-packaging+unsubscribe@opensuse.org To contact the owner, e-mail: opensuse-packaging+owner@opensuse.org
Hi, On Tue, 31 Oct 2017, Michal Kubecek wrote:
If you wish to call it byte order, you certainly can. But unlike for e.g. UTF-16, there is no actual need for "BOM" in UTF-8 text. It's just a way some editors do to say "this is UTF-8" - but it doesn't actually work in general and confuses various parsers.
Indeed. And unicode actually discourages using a BOM with UTF-8 encoded text streams (and forbids it for UTF-16LE and UTF-16BE tagged ones). Any solution involving a BOM is a non-solution (if perhaps mildly intellectually entertaining) as it's useless for UTF-8 text streams which predominantly exist on Linux and doesn't help with all the other encodings that are still in use, except the two encodings for which it was designed but which aren't usually used on our systems: UTF-16 and UTF-32. Ciao, Michael. -- To unsubscribe, e-mail: opensuse-packaging+unsubscribe@opensuse.org To contact the owner, e-mail: opensuse-packaging+owner@opensuse.org
On Freitag, 27. Oktober 2017 16:13:25 CEST jan matejek wrote:
Fellow mortals,
more and more packages need their locale to be set to something more sensible than C. This hit me while switching packages over to Python 3. Python gets its encoding from locale, so by default, it won't decode UTF-8 unless the appropriate encoding is set. Right now, when gtk-doc is switched to use Python 3, it won't build UTF-8 documentation.
This could be changed in gtk-doc itself (although that is impractical), but perhaps a better way is to change the default locale for spec scriptlets.
I think the gtk-doc problem is solved properly: https://github.com/GNOME/gtk-doc/commit/1eeec38a9a06a9956cdab9789cbd2ea1 Kind regards, Stefan-- To unsubscribe, e-mail: opensuse-packaging+unsubscribe@opensuse.org To contact the owner, e-mail: opensuse-packaging+owner@opensuse.org
Hi, On Fri, 27 Oct 2017, jan matejek wrote:
more and more packages need their locale to be set to something more sensible than C. This hit me while switching packages over to Python 3. Python gets its encoding from locale, so by default, it won't decode UTF-8 unless the appropriate encoding is set. Right now, when gtk-doc is switched to use Python 3, it won't build UTF-8 documentation.
This could be changed in gtk-doc itself (although that is impractical), but perhaps a better way is to change the default locale for spec scriptlets. Right now the macros set it to C. We could switch that to "C.UTF-8", "en_US.UTF-8", or export a special RPM variable that could be overriden in your spec file. Still, the default should be something with UTF-8 in it.
I'm now trying to build a Ring 1 staging project with this change. So far I have seen one failure related to it: with en_US.UTF-8, bash ranges (like [a-z]) are case-insensitive and match more than intended. That could be solved by changing the expression, or by setting locale to C.UTF-8.
Thoughts, comments?
As others have said already: this merely indicates a deeper problem in gtk-doc, for which a global change for all scriptlets is a crude work-around at best. As you found it has real ramifications. To see why it is a deeper problem in gtk-doc it's enough to think about the situation that some input files are in UTF-8 and some are in, let's say, SHIFT_JISX0213. No setting of $LANG will make it work, so setting $LANG is not the solution. Whatever the solution is, it must be in gtk-doc itself; it must somehow determine (and if by external knowledge) which files are in which encoding. Setting LANG to en_US.UTF-8 is a horrible idea for scripts (as you found out, collating order and ctypes get in the way as it's not ASCII compatible). Setting it to C.UTF-8 is not supported by upstream glibc (and yes it'd be nice if it would be). You need to find some other solution unfortunately. Ciao, Michael. -- To unsubscribe, e-mail: opensuse-packaging+unsubscribe@opensuse.org To contact the owner, e-mail: opensuse-packaging+owner@opensuse.org
participants (7)
-
Brüns, Stefan
-
Dominique Leuenberger / DimStar
-
Jan Engelhardt
-
jan matejek
-
Michael Matz
-
Michal Kubecek
-
Neal Gompa