"Stephen Holmes" <sholmes@novell.com> さんは書きました:
This isn't directly related to SUSE - I just happen to be using it, but I had a question about GCC's handling of an input file encoded as UTF-8. If I have something like
char * lstr = Schöne Grüße
Will the compiler compile this to UCS-4 internally, or just leave it as a byte sequence?
Here is an answer from our compiler expert Michael Matz <matz@suse.de>: matz> For string literals basically the byte values are transfered literally matz> from the input file to the object file. (C99 is more complicated than matz> this, but with UTF-8 files this is what it boils down to). matz> matz> With newer GCC versions (3.4 onward) normal string literals have their own matz> character set, which can be chosen by the user. By default this also is matz> UTF-8, so the behaviour is the same, i.e. UTF-8 strings are copied matz> literally to the object file. matz> matz> This is different with _wide_ character literals, i.e. strings of the form matz> L"Blabla" . With GCC 3.3 the input bytes are simply widened, i.e. such matz> wide character literal contains the same bytes as the UTF-8 encoding, matz> except that each byte is extended to 32 bit. With GCC 3.5 this is matz> different. There the input characters (in UTF-8) are converted into the matz> wide character set, which is UTF-32 (i.e. UCS-4) by default on machines matz> where wchar is 32 bit wide (i.e. all linux platforms). matz> matz> Hence, with gcc 3.3 there is no recoding involved. UTF-8 strings will be matz> placed into the object file. matz> matz> With newer gcc there is recoding for _wide_ string literals, from UTF-8 to matz> UCS-4 (i.e. 32 bit per character). Without user intervention there is no matz> recoding for normal string literals, i.e. UTF-8 strings remain such. matz> matz> > If I have something like matz> > matz> > char * lstr = Schöne Grüße matz> matz> So, given this (assuming the input is UTF-8): matz> char *str = "Schöne Grüße"; matz> wchar *wstr = L"Schöne Grüße"; matz> matz> with our compiler 'str' will be in UTF-8, 'wstr' will be in something matz> strange (UTF-8 but each byte extended to 32 bit). matz> matz> with newer compilers and no special options while compiling 'str' will be matz> in UTF-8, and 'wstr' will be in UCS-4. -- Mike FABIAN <mfabian@suse.de> http://www.suse.de/~mfabian 睡眠不足はいい仕事の敵だ。