On Freitag, 27. Oktober 2017 17:41:23 CEST Jan Engelhardt wrote:
On Friday 2017-10-27 16:13, jan matejek wrote:
more and more packages need their locale to be set to something more sensible than C. This hit me while switching packages over to Python 3. Python gets its encoding from locale, so by default, it won't decode UTF-8 unless the appropriate encoding is set.
I am not convinced that changing the rpm default helps. FWIW, a large portion of source files could be ISO-8859-1 (Windows still is a thing, and a popular one at that) so that LC=UTF-8 on a global scale would not help.
The only real fix is to use an in-file marker such that the file becomes self-describing, and there are sufficient examples in history how to pull that off:
- Byte Order Mark to determine UTF-8, UTF-{16,32}{BE,LE}
Byte order mark is ambiguous - it is three bytes, which are valid codepoints in e.g. ISO-8859-1. Granted, it is unlikely, but ...
- <?xml encoding="..." ?> in XML - <meta> in HTML likewise - "use utf8" in Perl (or something similar to it) - "# -*- coding: utf-8 -*-" in Python (PEP-0263)
All these guarantee the content up to and including the encoding specification is plain ASCII, so this is completely unambiguous and sane.
So there you have it. If Python falls over on UTF-8 files (I know Perl would), then those source files should say they are UTF-8. And those that are ISO-8859-1 should say they are iso-8859-1.
Keeping the locale at C would at least identify the important spots as python would stop execution.
Seconded, most utf-8 documents are also valid when interpreted as iso-8859-x, so guessing is a bad idea. Kind regards, Stefan -- To unsubscribe, e-mail: opensuse-packaging+unsubscribe@opensuse.org To contact the owner, e-mail: opensuse-packaging+owner@opensuse.org