[opensuse] Another slightly OT c question, howto handle extended ascii chars? >127
Listmates,
I'm parsing output that has the degree symbol in it in c. The character code
for the symbol is 167, but of course the ascii character set it limited to
0-127. Believe it or not, using a cut-n-paste into the strtok delimiter set
works, but that just feels like a cludge. Example:
const char delimiters[] = " +°,;:!-";
David C. Rankin - 17:11 29.08.09 wrote:
Listmates,
Hi,
I'm parsing output that has the degree symbol in it in c. The character code for the symbol is 167, but of course the ascii character set it limited to 0-127.
Well, as degree symbol is non-ascii, it's code depends on encoding you are using.
Believe it or not, using a cut-n-paste into the strtok delimiter set works, but that just feels like a cludge. Example:
const char delimiters[] = " +°,;:!-";
token = strtok (NULL, delimiters); Will break the string on ° correctly. But looking at how c is handling this clude causes concern:
for (i=0;i
Yields:
32 + 43 ??? -62 ??? -80 , 44 ; 59 : 58 ! 33 - 45
Hmm, any time I see the little ??? character, that's a bad sign. So, is there any trick to handling the chars that are outside of the normal ascii chars, but we seem to run into all the time? Is this the area of the thinly defined wchar_t?
You see these ??? as your terminal doesn't know how to handle this character, I would guess that you are using iso8859-1 somewhere (probably in sources) and utf-8 somewhere else. You don't have to worry as C can handle non-ascii characters very well - char can handle 8 bits so it can handle all 256 characters in all one byte encodings (like iso8859-1) wchar is used for some multibyte characters. -- Michal Hrusecky Package Maintainer SUSE LINUX, s.r.o e-mail: mhrusecky@suse.cz -- To unsubscribe, e-mail: opensuse+unsubscribe@opensuse.org For additional commands, e-mail: opensuse+help@opensuse.org
On Monday 31 August 2009 05:15:58 Michal Hrusecky wrote:
David C. Rankin - 17:11 29.08.09 wrote:
Listmates,
Hi,
I'm parsing output that has the degree symbol in it in c. The character code for the symbol is 167, but of course the ascii character set it limited to 0-127.
Well, as degree symbol is non-ascii, it's code depends on encoding you are using.
Believe it or not, using a cut-n-paste into the strtok delimiter set works, but that just feels like a cludge. Example:
const char delimiters[] = " +°,;:!-";
token = strtok (NULL, delimiters); Will break the string on ° correctly. But looking at how c is handling this clude causes concern:
for (i=0;i
Yields:
32 + 43 ??? -62 ??? -80 , 44 ; 59
: 58
! 33 - 45
Hmm, any time I see the little ??? character, that's a bad sign. So, is there any trick to handling the chars that are outside of the normal ascii chars, but we seem to run into all the time? Is this the area of the thinly defined wchar_t?
You see these ??? as your terminal doesn't know how to handle this character, I would guess that you are using iso8859-1 somewhere (probably in sources) and utf-8 somewhere else. You don't have to worry as C can handle non-ascii characters very well - char can handle 8 bits so it can handle all 256 characters in all one byte encodings (like iso8859-1) wchar is used for some multibyte characters.
-- Michal Hrusecky
Package Maintainer SUSE LINUX, s.r.o e-mail: mhrusecky@suse.cz
Hmmm. Michal suggests you might be using iso8850-1 and utf-8 in different places. I suspect something like that too, but I also wonder if you transposed digits of the character code you reported above. "man iso_8859-1" shows the DEGREE SIGN code to be 176, not 167. A signed integer byte value of 176 (hex B0) would be printed as -80. The fact that your program output included -62 (hex C2) immediately before the -80 leads me to believe that the encoding for the degree symbol in the C source is UTF-8. The byte sequence 0xc2 0xb0 would be the UTF-8 encoding for the degree symbol. If you want to reduce some of your confusion when dealing with 8-bit character codes in C, you might consider always using "unsigned char" in place of "char". That way greater than or less than comparisons will work in a less confusing manner (though not necessarily exactly the same as lexical order, depending on character coding). -- Jim
On Monday 31 August 2009 03:30:11 pm Jim Cunning wrote:
Hmmm. Michal suggests you might be using iso8850-1 and utf-8 in different places. I suspect something like that too, but I also wonder if you transposed digits of the character code you reported above. "man iso_8859-1" shows the DEGREE SIGN code to be 176, not 167. A signed integer byte value of 176 (hex B0) would be printed as -80.
Yep, fat fingers...
The fact that your program output included -62 (hex C2) immediately before the -80 leads me to believe that the encoding for the degree symbol in the C source is UTF-8. The byte sequence 0xc2 0xb0 would be the UTF-8 encoding for the degree symbol.
If you want to reduce some of your confusion when dealing with 8-bit character codes in C, you might consider always using "unsigned char" in place of "char". That way greater than or less than comparisons will work in a less confusing manner (though not necessarily exactly the same as lexical order, depending on character coding).
Thanks Jim: With the unsigned char, the numbers get bigger ;-) � 4294967234 � 4294967216 -- David C. Rankin, J.D.,P.E. Rankin Law Firm, PLLC 510 Ochiltree Street Nacogdoches, Texas 75961 Telephone: (936) 715-9333 Facsimile: (936) 715-9339 www.rankinlawfirm.com -- To unsubscribe, e-mail: opensuse+unsubscribe@opensuse.org For additional commands, e-mail: opensuse+help@opensuse.org
On Monday 31 August 2009 17:18:36 David C. Rankin wrote:
On Monday 31 August 2009 03:30:11 pm Jim Cunning wrote: [...]
The fact that your program output included -62 (hex C2) immediately before the -80 leads me to believe that the encoding for the degree symbol in the C source is UTF-8. The byte sequence 0xc2 0xb0 would be the UTF-8 encoding for the degree symbol.
If you want to reduce some of your confusion when dealing with 8-bit character codes in C, you might consider always using "unsigned char" in place of "char". That way greater than or less than comparisons will work in a less confusing manner (though not necessarily exactly the same as lexical order, depending on character coding).
Thanks Jim:
With the unsigned char, the numbers get bigger ;-)
� 4294967234 � 4294967216
I also forgot to mention that when the compiler parses the printf format string, arguments corresponding to %d are promoted to (signed or unsigned) int. I don't remember where it's specified, but C normally does sign extension during integer coercion, so b'10110000' becomes b'11111111111111111111111111111011'. Thus, 4294967216 is the 32-bit unsigned equivalent of -80. You can change the output you get by adding " & 0xff" to every integer argument you really only want the lower 8 bits of. -- Jim
On Saturday 29 August 2009 17:11:23 David C. Rankin wrote:
Listmates,
I'm parsing output that has the degree symbol in it in c. The character code for the symbol is 167, but of course the ascii character set it limited to 0-127. Believe it or not, using a cut-n-paste into the strtok delimiter set works, but that just feels like a cludge. Example:
Find the unicode code point. Use a universal character name. (In this case, \u00B7, I think.) Any universal character name that is \u00FF or less can be a char; \u0100 and above have to be in a wchar_t. Using multi-byte characters in the source results in implementation-specific behavior, and I don't know if strtok handles multi-byte characters properly. See sections 6.4.3, 6.4.4.4.1, 6.4.4.4.9, and 6.4.5.1 in ISO/IEC 9899:1999 to begin with. -- Boyd Stephen Smith Jr. ,= ,-_-. =. bss@iguanasuicide.net ((_/)o o(\_)) ICQ: 514984 YM/AIM: DaTwinkDaddy `-'(. .)`-' http://iguanasuicide.net/ \_/
participants (4)
-
Boyd Stephen Smith Jr.
-
David C. Rankin
-
Jim Cunning
-
Michal Hrusecky