"Stephen Holmes" さんは書きました:
This isn't directly related to SUSE - I just happen to be using it,
but I had a question about GCC's handling of an input file encoded
as UTF-8. If I have something like
char * lstr = Schöne Grüße
Will the compiler compile this to UCS-4 internally, or just leave it
as a byte sequence?
Here is an answer from our compiler expert Michael Matz
:
matz> For string literals basically the byte values are transfered literally
matz> from the input file to the object file. (C99 is more complicated than
matz> this, but with UTF-8 files this is what it boils down to).
matz>
matz> With newer GCC versions (3.4 onward) normal string literals have their own
matz> character set, which can be chosen by the user. By default this also is
matz> UTF-8, so the behaviour is the same, i.e. UTF-8 strings are copied
matz> literally to the object file.
matz>
matz> This is different with _wide_ character literals, i.e. strings of the form
matz> L"Blabla" . With GCC 3.3 the input bytes are simply widened, i.e. such
matz> wide character literal contains the same bytes as the UTF-8 encoding,
matz> except that each byte is extended to 32 bit. With GCC 3.5 this is
matz> different. There the input characters (in UTF-8) are converted into the
matz> wide character set, which is UTF-32 (i.e. UCS-4) by default on machines
matz> where wchar is 32 bit wide (i.e. all linux platforms).
matz>
matz> Hence, with gcc 3.3 there is no recoding involved. UTF-8 strings will be
matz> placed into the object file.
matz>
matz> With newer gcc there is recoding for _wide_ string literals, from UTF-8 to
matz> UCS-4 (i.e. 32 bit per character). Without user intervention there is no
matz> recoding for normal string literals, i.e. UTF-8 strings remain such.
matz>
matz> > If I have something like
matz> >
matz> > char * lstr = Schöne Grüße
matz>
matz> So, given this (assuming the input is UTF-8):
matz> char *str = "Schöne Grüße";
matz> wchar *wstr = L"Schöne Grüße";
matz>
matz> with our compiler 'str' will be in UTF-8, 'wstr' will be in something
matz> strange (UTF-8 but each byte extended to 32 bit).
matz>
matz> with newer compilers and no special options while compiling 'str' will be
matz> in UTF-8, and 'wstr' will be in UCS-4.
--
Mike FABIAN http://www.suse.de/~mfabian
睡眠不足はいい仕事の敵だ。