[yast-devel] YCP String operator [] and UTF-8
Hi all, This is just a note for you in case you came across a strange UTF-8 string problem in YCP (maybe you already know that but for me it was quite a surprise): YCP string operator [] takes _byte_ index in the string, while size(string) returns _number_of_characters_. The problem is when you combine both functions, the result will be probably buggy, see https://bugzilla.novell.com/show_bug.cgi?id=728588 Example from the bug: size("áa") => 2, but "áa"[1] is not "a" as expected but the second _byte_ of the string which is one half of the "á" UTF-8 character, if you remove it you'll get garbage in the string... Keep this in your mind when iterating over YCP strings... (I'm not sure whether fixing YCPString::[] would be a good idea, it might break something else. Martin?) -- Ladislav Slezák Appliance department / YaST Developer Lihovarská 1060/12 190 00 Prague 9 / Czech Republic tel: +420 284 028 960 lslezak@suse.com SUSE -- To unsubscribe, e-mail: yast-devel+unsubscribe@opensuse.org To contact the owner, e-mail: yast-devel+owner@opensuse.org
On Fri, 30 Mar 2012 17:01:04 +0200 Ladislav Slezak <lslezak@suse.cz> wrote:
Hi all,
This is just a note for you in case you came across a strange UTF-8 string problem in YCP (maybe you already know that but for me it was quite a surprise):
YCP string operator [] takes _byte_ index in the string, while size(string) returns _number_of_characters_. The problem is when you combine both functions, the result will be probably buggy, see https://bugzilla.novell.com/show_bug.cgi?id=728588
Example from the bug:
size("áa") => 2,
but
"áa"[1] is not "a" as expected but the second _byte_ of the string which is one half of the "á" UTF-8 character, if you remove it you'll get garbage in the string...
Keep this in your mind when iterating over YCP strings...
(I'm not sure whether fixing YCPString::[] would be a good idea, it might break something else. Martin?)
When I compare behaviour to other languages like wstring in C++ or ruby string, I see this behaviour really strange. operator[] is expected to return element at given position. For a lot of programmer , which don't have ycp as first first language, is string is array of character ( not bytes ). So I really vote for change of this behaviour. Josef
--
Ladislav Slezák Appliance department / YaST Developer Lihovarská 1060/12 190 00 Prague 9 / Czech Republic tel: +420 284 028 960 lslezak@suse.com SUSE
-- To unsubscribe, e-mail: yast-devel+unsubscribe@opensuse.org To contact the owner, e-mail: yast-devel+owner@opensuse.org
Hello, On Mar 30 17:01 Ladislav Slezak wrote (excerpt):
YCP string operator [] takes _byte_ index in the string, while size(string) returns _number_of_characters_. The problem is when you combine both functions, the result will be probably buggy, see https://bugzilla.novell.com/show_bug.cgi?id=728588
Could a YCP expert show the correct way (example code) how to remove an UTF8 sub-string from an UTF8 string? I.e. how to remove "Bar" from "FooBarBaz" in an UTF8-safe way? More generally: Is there documentation how to work on UTF8 strings with YCP in general? Kind Regards Johannes Meixner -- SUSE LINUX Products GmbH -- Maxfeldstrasse 5 -- 90409 Nuernberg -- Germany HRB 16746 (AG Nuernberg) GF: Jeff Hawn, Jennifer Guild, Felix Imendoerffer -- To unsubscribe, e-mail: yast-devel+unsubscribe@opensuse.org To contact the owner, e-mail: yast-devel+owner@opensuse.org
Dne 3.4.2012 09:37, Johannes Meixner napsal(a):
Could a YCP expert show the correct way (example code) how to remove an UTF8 sub-string from an UTF8 string? I.e. how to remove "Bar" from "FooBarBaz" in an UTF8-safe way?
More generally: Is there documentation how to work on UTF8 strings with YCP in general?
Um, it seems that the documentation does not mention the [] behavior for string (http://doc.opensuse.org/projects/YaST/openSUSE11.3/tdg/bracket.html nor http://doc.opensuse.org/projects/YaST/openSUSE11.3/tdg/id_ycp_data_string.ht...). I fixed that particular bug by using regexpsub() instead of iterating over the string and using [] operator. So I guess the regexp*() functions are UTF-8 safe. I'm not sure what the correct solution is. Maybe the correct way is to fix the [] operator after all... That's why I have opened this discussion, because maybe someone is relaying on the current "buggy" behavior... (I don't expect that but I'd like to avoid regressions if possible.) -- Best Regards Ladislav Slezák Yast Developer ------------------------------------------------------------------------ SUSE LINUX, s.r.o. e-mail: lslezak@suse.cz Lihovarská 1060/12 tel: +420 284 028 960 190 00 Prague 9 fax: +420 284 028 951 Czech Republic http://www.suse.cz/ -- To unsubscribe, e-mail: yast-devel+unsubscribe@opensuse.org To contact the owner, e-mail: yast-devel+owner@opensuse.org
* Ladislav Slezak <lslezak@suse.cz> [Apr 03. 2012 10:08]:
I'm not sure what the correct solution is. Maybe the correct way is to fix the [] operator after all...
In YCP, all strings are supposed to be UTF-8 encoded. This is how YaST was designed from the beginning.
From that perspective, fixing the [] operator for strings is the right solution.
Regards, Klaus --- SUSE LINUX Products GmbH, GF: Jeff Hawn, Jennifer Guild, Felix Imendörffer, HRB 16746 (AG Nürnberg) Maxfeldstraße 5, 90409 Nürnberg, Germany -- To unsubscribe, e-mail: yast-devel+unsubscribe@opensuse.org To contact the owner, e-mail: yast-devel+owner@opensuse.org
Dne 3.4.2012 10:15, Klaus Kaempf napsal(a):
* Ladislav Slezak<lslezak@suse.cz> [Apr 03. 2012 10:08]:
I'm not sure what the correct solution is. Maybe the correct way is to fix the [] operator after all...
In YCP, all strings are supposed to be UTF-8 encoded. This is how YaST was designed from the beginning. From that perspective, fixing the [] operator for strings is the right solution.
Ooops, I just realized that the problem is actually in substring() function, [] operator works only with lists, maps or terms in YCP. (I'm using so many languages...). I used substring() to get one character. So the problematic call is actually: substring("áa", 1, 1); which returns "\0xF1" instead of "a" as I expected. The documentation does not tell whether the substring() argument units are in bytes or characters. http://doc.opensuse.org/projects/YaST/openSUSE11.3/tdg/substring-rest.html So any opinions on changing this call? Is the UTF-8 assumption also valid here? Thanks! -- Best Regards Ladislav Slezák Yast Developer ------------------------------------------------------------------------ SUSE LINUX, s.r.o. e-mail: lslezak@suse.cz Lihovarská 1060/12 tel: +420 284 028 960 190 00 Prague 9 fax: +420 284 028 951 Czech Republic http://www.suse.cz/ -- To unsubscribe, e-mail: yast-devel+unsubscribe@opensuse.org To contact the owner, e-mail: yast-devel+owner@opensuse.org
* Ladislav Slezak <lslezak@suse.cz> [Apr 03. 2012 11:10]:
Ooops, I just realized that the problem is actually in substring() function, [] operator works only with lists, maps or terms in YCP. (I'm using so many languages...).
:-)
I used substring() to get one character. So the problematic call is actually:
substring("áa", 1, 1);
which returns "\0xF1" instead of "a" as I expected.
The documentation does not tell whether the substring() argument units are in bytes or characters. http://doc.opensuse.org/projects/YaST/openSUSE11.3/tdg/substring-rest.html
So any opinions on changing this call? Is the UTF-8 assumption also valid here?
Yes. sub_string_ is operating on strings and strings are defined to be UTF-8 encoded. Regards, Klaus --- SUSE LINUX Products GmbH, GF: Jeff Hawn, Jennifer Guild, Felix Imendörffer, HRB 16746 (AG Nürnberg) Maxfeldstraße 5, 90409 Nürnberg, Germany -- To unsubscribe, e-mail: yast-devel+unsubscribe@opensuse.org To contact the owner, e-mail: yast-devel+owner@opensuse.org
Dne 3.4.2012 11:33, Klaus Kaempf napsal(a):
So any opinions on changing this call? Is the UTF-8 assumption also valid here?
Yes. sub_string_ is operating on strings and strings are defined to be UTF-8 encoded.
Thanks for the feedback, reported as bnc#755414 (https://bugzilla.novell.com/show_bug.cgi?id=755414) -- Best Regards Ladislav Slezák Yast Developer ------------------------------------------------------------------------ SUSE LINUX, s.r.o. e-mail: lslezak@suse.cz Lihovarská 1060/12 tel: +420 284 028 960 190 00 Prague 9 fax: +420 284 028 951 Czech Republic http://www.suse.cz/ -- To unsubscribe, e-mail: yast-devel+unsubscribe@opensuse.org To contact the owner, e-mail: yast-devel+owner@opensuse.org
On Tue, Apr 03, 2012 at 11:33:09AM +0200, Klaus Kaempf wrote:
* Ladislav Slezak <lslezak@suse.cz> [Apr 03. 2012 11:10]:
I used substring() to get one character. So the problematic call is actually:
substring("áa", 1, 1);
which returns "\0xF1" instead of "a" as I expected.
The documentation does not tell whether the substring() argument units are in bytes or characters. http://doc.opensuse.org/projects/YaST/openSUSE11.3/tdg/substring-rest.html
So any opinions on changing this call? Is the UTF-8 assumption also valid here?
Yes. sub_string_ is operating on strings and strings are defined to be UTF-8 encoded.
Generally I agree that strings in YCP are UTF-8 encoded and functions should respect this. But simply fixing the functions might require converting from UTF-8 to wstring and back in every function and that sounds very costly. E.g. the size functions in YCP converts the string to wstring. When I noticed that and saw how many time size(string) == 0 is used I added an isempty function in YCP. Could be that using wstring internally in YCPString is the better solution. Regards, Arvin -- Arvin Schnell, <aschnell@suse.de> Senior Software Engineer, Research & Development SUSE LINUX Products GmbH, GF: Jeff Hawn, Jennifer Guild, Felix Imendörffer, HRB 16746 (AG Nürnberg) Maxfeldstraße 5 90409 Nürnberg Germany -- To unsubscribe, e-mail: yast-devel+unsubscribe@opensuse.org To contact the owner, e-mail: yast-devel+owner@opensuse.org
On Tue, 3 Apr 2012 11:58:16 +0200 Arvin Schnell <aschnell@suse.de> wrote:
On Tue, Apr 03, 2012 at 11:33:09AM +0200, Klaus Kaempf wrote:
* Ladislav Slezak <lslezak@suse.cz> [Apr 03. 2012 11:10]:
I used substring() to get one character. So the problematic call is actually:
substring("áa", 1, 1);
which returns "\0xF1" instead of "a" as I expected.
The documentation does not tell whether the substring() argument units are in bytes or characters. http://doc.opensuse.org/projects/YaST/openSUSE11.3/tdg/substring-rest.html
So any opinions on changing this call? Is the UTF-8 assumption also valid here?
Yes. sub_string_ is operating on strings and strings are defined to be UTF-8 encoded.
Generally I agree that strings in YCP are UTF-8 encoded and functions should respect this.
But simply fixing the functions might require converting from UTF-8 to wstring and back in every function and that sounds very costly. E.g. the size functions in YCP converts the string to wstring. When I noticed that and saw how many time size(string) == 0 is used I added an isempty function in YCP.
Could be that using wstring internally in YCPString is the better solution.
I absolutelly agree. If we have each string as UTF string in ycp, then not using wstring doesn't make much sense to me. Of course we need to check which depends on it, but I think that it should be mainly various bindings. Other part of code should not be interested what is internal representation. Josef
Regards, Arvin
-- To unsubscribe, e-mail: yast-devel+unsubscribe@opensuse.org To contact the owner, e-mail: yast-devel+owner@opensuse.org
On Tue, Apr 03, 2012 at 11:07:57AM +0200, Ladislav Slezak wrote:
Dne 3.4.2012 10:15, Klaus Kaempf napsal(a):
Ooops, I just realized that the problem is actually in substring() function, [] operator works only with lists, maps or terms in YCP. (I'm using so many languages...).
I used substring() to get one character. So the problematic call is actually:
substring("áa", 1, 1);
which returns "\0xF1" instead of "a" as I expected.
The documentation does not tell whether the substring() argument units are in bytes or characters. http://doc.opensuse.org/projects/YaST/openSUSE11.3/tdg/substring-rest.html
YCP has the function lsubstring: http://doc.opensuse.org/projects/YaST/openSUSE11.3/tdg/lsubstring-rest.html For many functions though there is no UTF-8 aware version, e.g. search, tolower, deletechars, findfirstof, ... Regards, Arvin -- Arvin Schnell, <aschnell@suse.de> Senior Software Engineer, Research & Development SUSE LINUX Products GmbH, GF: Jeff Hawn, Jennifer Guild, Felix Imendörffer, HRB 16746 (AG Nürnberg) Maxfeldstraße 5 90409 Nürnberg Germany -- To unsubscribe, e-mail: yast-devel+unsubscribe@opensuse.org To contact the owner, e-mail: yast-devel+owner@opensuse.org
Dne 3.4.2012 15:39, Arvin Schnell napsal(a):
YCP has the function lsubstring:
http://doc.opensuse.org/projects/YaST/openSUSE11.3/tdg/lsubstring-rest.html
Yeah, yes, thanks for pointing it out! But shouldn't be the default behavior opposite?, i.e. the default substring() should be UTF-8 aware and have an extra function for byte operation? My quick grep in SVN trunk found just one (!!) usage of lsubstring() function in all YCP code. That's suspicious, I guess there could be other misused substring() calls... Well, changing the default behavior is very likely not an option as we have the UTF-8 aware call. So should we do substring() usage audit? Check all usages? What do you think? (Actually the problem is not so critical, the "bug" is there for ages, but IMHO we should do something about it...) -- Best Regards Ladislav Slezák Yast Developer ------------------------------------------------------------------------ SUSE LINUX, s.r.o. e-mail: lslezak@suse.cz Lihovarská 1060/12 tel: +420 284 028 960 190 00 Prague 9 fax: +420 284 028 951 Czech Republic http://www.suse.cz/ -- To unsubscribe, e-mail: yast-devel+unsubscribe@opensuse.org To contact the owner, e-mail: yast-devel+owner@opensuse.org
Hello, On Apr 3 18:19 Ladislav Slezak wrote (excerpt):
Dne 3.4.2012 15:39, Arvin Schnell napsal(a):
YCP has the function lsubstring: http://doc.opensuse.org/projects/YaST/openSUSE11.3/tdg/lsubstring-rest.html ... But shouldn't be the default behavior opposite?, i.e. the default substring() should be UTF-8 aware and have an extra function for byte operation?
See what Klaus Kaempf already wrote in this thread: ------------------------------------------------------------------------- In YCP, all strings are supposed to be UTF-8 encoded. This is how YaST was designed from the beginning. -------------------------------------------------------------------------
... Is the UTF-8 assumption also valid here? Yes. sub_string_ is operating on strings and strings are defined to be UTF-8 encoded.
This is also what is documented at http://doc.opensuse.org/projects/YaST/openSUSE11.3/tdg/id_ycp_data_string.ht... ------------------------------------------------------------------------- String constants consist of UNICODE characters encoded in UTF8. ------------------------------------------------------------------------- Therefore substring() must work on UTF-8 strings and an extra function for byte operation on strings might exist optionaly. But what is a practical use-case for byte operation on strings? Byte operation on UTF-8 encoded strings means that a multibyte character gets split into several one-byte values where each of those one-byte values is meaningless so that I cannot imagine a real use-case for byte operation on UTF-8 encoded strings. To extract ASCII characters from an UTF-8 encoded string there is toascii("UTF-8 encoded string") If byte operation on strings would be a useful functionality: Perhaps the [] operator which is currently not available for strings could be used to implement byte operation on strings? By the way: I wonder what tolist("UTF-8 encoded string") respecitvely (list)"UTF-8 encoded string" would result? Would it result a list of byte values where a multibyte character gets split into several one-byte list elements or would it result a list of integer values where a multibyte character becomes a single integer list elements or anytjing else? Perhaps (list<byteblock>)"an UTF-8 encoded string" results the string split into byteblocks where each character becomes a single byteblock list element (i.e. a multibyte character becomes a single byteblock list element)?
My quick grep in SVN trunk found just one (!!) usage of lsubstring() function in all YCP code. That's suspicious, I guess there could be other misused substring() calls...
Because in YCP strings are UTF-8 encoded there cannot be a misuse of substring("UTF-8 encoded string") and the only "misuse" i.e. "misuderstanding what strings are in YCP" from my point of view is that lsubstring() exists at all. Kind Regards Johannes Meixner -- SUSE LINUX Products GmbH -- Maxfeldstrasse 5 -- 90409 Nuernberg -- Germany HRB 16746 (AG Nuernberg) GF: Jeff Hawn, Jennifer Guild, Felix Imendoerffer -- To unsubscribe, e-mail: yast-devel+unsubscribe@opensuse.org To contact the owner, e-mail: yast-devel+owner@opensuse.org
On Tue, Apr 03, 2012 at 10:06:18AM +0200, Ladislav Slezak wrote:
Dne 3.4.2012 09:37, Johannes Meixner napsal(a):
Could a YCP expert show the correct way (example code) how to remove an UTF8 sub-string from an UTF8 string? I.e. how to remove "Bar" from "FooBarBaz" in an UTF8-safe way?
More generally: Is there documentation how to work on UTF8 strings with YCP in general?
Um, it seems that the documentation does not mention the [] behavior for string (http://doc.opensuse.org/projects/YaST/openSUSE11.3/tdg/bracket.html nor http://doc.opensuse.org/projects/YaST/openSUSE11.3/tdg/id_ycp_data_string.ht...).
I fixed that particular bug by using regexpsub() instead of iterating over the string and using [] operator. So I guess the regexp*() functions are UTF-8 safe.
I'm not sure what the correct solution is. Maybe the correct way is to fix the [] operator after all... That's why I have opened this discussion, because maybe someone is relaying on the current "buggy" behavior... (I don't expect that but I'd like to avoid regressions if possible.)
Unfortunately this can be the case since most other string functions also do not respect UTF-8. E.g. splitting a string at a space by using search and substring works correctly with substring since search is also byte-oriented. Program: string s = "schöner Würfel"; integer i = search(s, " "); y2milestone("substring '%1' '%2'", substring(s, 0, i), substring(s, i + 1)); y2milestone("lsubstring '%1' '%2'", lsubstring(s, 0, i), lsubstring(s, i + 1)); Output: test1.ycp:6 substring 'schöner' 'Würfel' test1.ycp:7 lsubstring 'schöner ' 'ürfel' So, if substring is fixed all other functions must also. Regards, Arvin -- Arvin Schnell, <aschnell@suse.de> Senior Software Engineer, Research & Development SUSE LINUX Products GmbH, GF: Jeff Hawn, Jennifer Guild, Felix Imendörffer, HRB 16746 (AG Nürnberg) Maxfeldstraße 5 90409 Nürnberg Germany -- To unsubscribe, e-mail: yast-devel+unsubscribe@opensuse.org To contact the owner, e-mail: yast-devel+owner@opensuse.org
On Thu, 5 Apr 2012 10:56:34 +0200 Arvin Schnell <aschnell@suse.de> wrote:
On Tue, Apr 03, 2012 at 10:06:18AM +0200, Ladislav Slezak wrote:
Dne 3.4.2012 09:37, Johannes Meixner napsal(a):
Could a YCP expert show the correct way (example code) how to remove an UTF8 sub-string from an UTF8 string? I.e. how to remove "Bar" from "FooBarBaz" in an UTF8-safe way?
More generally: Is there documentation how to work on UTF8 strings with YCP in general?
Um, it seems that the documentation does not mention the [] behavior for string (http://doc.opensuse.org/projects/YaST/openSUSE11.3/tdg/bracket.html nor http://doc.opensuse.org/projects/YaST/openSUSE11.3/tdg/id_ycp_data_string.ht...).
I fixed that particular bug by using regexpsub() instead of iterating over the string and using [] operator. So I guess the regexp*() functions are UTF-8 safe.
I'm not sure what the correct solution is. Maybe the correct way is to fix the [] operator after all... That's why I have opened this discussion, because maybe someone is relaying on the current "buggy" behavior... (I don't expect that but I'd like to avoid regressions if possible.)
Unfortunately this can be the case since most other string functions also do not respect UTF-8. E.g. splitting a string at a space by using search and substring works correctly with substring since search is also byte-oriented.
Program:
string s = "schöner Würfel"; integer i = search(s, " ");
y2milestone("substring '%1' '%2'", substring(s, 0, i), substring(s, i + 1)); y2milestone("lsubstring '%1' '%2'", lsubstring(s, 0, i), lsubstring(s, i + 1));
Output:
test1.ycp:6 substring 'schöner' 'Würfel' test1.ycp:7 lsubstring 'schöner ' 'ürfel'
So, if substring is fixed all other functions must also.
I still think that main problem is that we want unicode strings in YaST, but we use in backend string. I think that now it is right time to try to switch string implementation to wstring and be ready to such change. I can create set of patches if someone is interested in it and test it. Josef
Regards, Arvin
-- To unsubscribe, e-mail: yast-devel+unsubscribe@opensuse.org To contact the owner, e-mail: yast-devel+owner@opensuse.org
On Thu, Apr 05, 2012 at 11:02:35AM +0200, Josef Reidinger wrote:
I still think that main problem is that we want unicode strings in YaST, but we use in backend string. I think that now it is right time to try to switch string implementation to wstring and be ready to such change. I can create set of patches if someone is interested in it and test it.
I also had a look at this. But changing the main member variable of YCPString from std::string to std::wstring is not trivial possible since the function value_cstr exposes it and that function is used not only in yast2-core. I also noticed that glibc does not provide regex functions for wchar_t. Here we would have to use boost adding a dependencies on libboost_regex. So maybe just fixing the few builtins is easier. My idea is to use std::string operations when the YCPStrings are ASCII, std::wstring otherwise. I checked how many YCPString are used and how many of these are non-ascii: Starting yast2 storage (in german locale) constructs about 15000 YCPStrings of which only 400 are non-ascii (almost all help texts). So the idea seems sane. You can find a patch in ~aschnell/Export/yast2-core-utf8.1.diff. Be aware that it changes to ABI of libycp, so you have to recompile a lot of stuff. Comments? Arvin -- Arvin Schnell, <aschnell@suse.de> Senior Software Engineer, Research & Development SUSE LINUX Products GmbH, GF: Jeff Hawn, Jennifer Guild, Felix Imendörffer, HRB 16746 (AG Nürnberg) Maxfeldstraße 5 90409 Nürnberg Germany -- To unsubscribe, e-mail: yast-devel+unsubscribe@opensuse.org To contact the owner, e-mail: yast-devel+owner@opensuse.org
Hi, On Tue, Apr 10, Arvin Schnell wrote:
On Thu, Apr 05, 2012 at 11:02:35AM +0200, Josef Reidinger wrote:
I still think that main problem is that we want unicode strings in YaST, but we use in backend string. I think that now it is right time to try to switch string implementation to wstring and be ready to such change. I can create set of patches if someone is interested in it and test it.
I also had a look at this. But changing the main member variable of YCPString from std::string to std::wstring is not trivial possible since the function value_cstr exposes it and that function is used not only in yast2-core. I also noticed that glibc does not provide regex functions for wchar_t. Here we would have to use boost adding a dependencies on libboost_regex.
So maybe just fixing the few builtins is easier. My idea is to use std::string operations when the YCPStrings are ASCII, std::wstring otherwise. I checked how many YCPString are used and how many of these are non-ascii: Starting yast2 storage (in german locale) constructs about 15000 YCPStrings of which only 400 are non-ascii (almost all help texts). So the idea seems sane.
You can find a patch in ~aschnell/Export/yast2-core-utf8.1.diff. Be aware that it changes to ABI of libycp, so you have to recompile a lot of stuff.
Comments?
I would support this approach. The diff is not really small but this approach minimizes the impact of the change to other parts of yast2 while not introducing too big penalties regarding performance and memory requirements for the majority of ascii strings. Tschuess, Thomas Fehr -- Thomas Fehr, SuSE Linux Products GmbH, Maxfeldstr. 5, 90409 Nuernberg, Germany GF: Jeff Hawn, Jennifer Guild, Felix Imendörffer, HRB 16746 (AG Nürnberg) Tel: +49-911-74053-0, Fax: +49-911-74053-482, Email: fehr@suse.de GPG public key available. -- To unsubscribe, e-mail: yast-devel+unsubscribe@opensuse.org To contact the owner, e-mail: yast-devel+owner@opensuse.org
participants (6)
-
Arvin Schnell
-
Johannes Meixner
-
Josef Reidinger
-
Klaus Kaempf
-
Ladislav Slezak
-
Thomas Fehr