Re: [m17n] GCC and Unicode

25 Aug 2004

      Fantastic -  this anwers my quesiton - thanks! 

S t e p h e n H o l m e s
Localisation Manager, Novell

Tel: +353-1-605-8087
Cell: +353-833-5027
Fax: +353-1-605-8200

Novell Inc., the leading provider of information solutions
"This e-mail was sent from a 100% Linux desktop. We're back, we're loud, and we're serious"
...
...
...
Mike FABIAN <mfabian@suse.de> 08/23/04 3:43 pm >>>
Stephen Holmes <sholmes@novell.com> *******:
...
This isn't directly related to SUSE - I just happen to be using it,
but I had a question about GCC's handling of an input file encoded
as UTF-8.  If I have something like
char * lstr = Schöne Grüße
Will the compiler compile this to UCS-4 internally, or just leave it
as a byte sequence?
Here is an answer from our compiler expert Michael Matz
<matz@suse.de>:

matz> For string literals basically the byte values are transfered literally
matz> from the input file to the object file.  (C99 is more complicated than
matz> this, but with UTF-8 files this is what it boils down to).
matz>
matz> With newer GCC versions (3.4 onward) normal string literals have their own
matz> character set, which can be chosen by the user.  By default this also is
matz> UTF-8, so the behaviour is the same, i.e. UTF-8 strings are copied
matz> literally to the object file.
matz>
matz> This is different with _wide_ character literals, i.e. strings of the form
matz> LBlabla .  With GCC 3.3 the input bytes are simply widened, i.e. such
matz> wide character literal contains the same bytes as the UTF-8 encoding,
matz> except that each byte is extended to 32 bit. With GCC 3.5 this is
matz> different.  There the input characters (in UTF-8) are converted into the
matz> wide character set, which is UTF-32 (i.e. UCS-4) by default on machines
matz> where wchar is 32 bit wide (i.e. all linux platforms).
matz>
matz> Hence, with gcc 3.3 there is no recoding involved.  UTF-8 strings will be
matz> placed into the object file.
matz>
matz> With newer gcc there is recoding for _wide_ string literals, from UTF-8 to
matz> UCS-4 (i.e. 32 bit per character).  Without user intervention there is no
matz> recoding for normal string literals, i.e. UTF-8 strings remain such.
matz>
matz> > If I have something like
matz> >
matz> >       char * lstr = Schöne Grüße
matz>
matz> So, given this (assuming the input is UTF-8):
matz>   char *str = Schöne Grüße;
matz>   wchar *wstr = LSchöne Grüße;
matz>
matz> with our compiler 'str' will be in UTF-8, 'wstr' will be in something
matz> strange (UTF-8 but each byte extended to 32 bit).
matz>
matz> with newer compilers and no special options while compiling 'str' will be
matz> in UTF-8, and 'wstr' will be in UCS-4.

--
Mike FABIAN   <mfabian@suse.de>   http://www.suse.de/~mfabian
*****

--
To unsubscribe, e-mail: m17n-unsubscribe@suse.com
For additional commands, e-mail: m17n-help@suse.com

Stephen Holmes

tags

participants (1)