Re: [opensuse] strange samba rsync problem

2 Sep 2008

      ----- Original Message ----- 
From: "Damon Register" 
To: 
Sent: Tuesday, September 02, 2008 12:23 PM
Subject: Re: [opensuse] strange samba rsync problem
...
...
Odd, and worrying, I assume you have checked the consistency of the
result i.e whether this happens to the same file in the same place, is
I think it is quite repeatable.  As far as I know, that one file was
G T Smith wrote:
the only one with non-ascii letters in the name.  Just for fun I tried
making a test here at work where we have a solaris hosted samba server
and a drive mapped to it on our PCs.  I created a plain text file with
an accented a in the name.  I copied it to a folder on the mapped drive.
I logged into the Solaris system and did ls -l on that folder.
The accented a was mangled.  Stranger yet, on that Solaris system I
ran the nautilus file manager where it showed the same file with
the correct accent on the a.
That's not strange at all. That's just the natural consequence of using different character sets in different interfaces to display the same string of bytes.

What if any measures have you taken to ensure, or at last assure, that all things which touch the file are either all using the same character set and encoding, or failing that, that all parts are accurately and fully configured to know what character sets and encodings all other parts are using so that they may correctly translate in those cases where they might do so?

If you have no idea, then you will regularly see what _looks_ like errors like this, unless you simply avoid using any characters outside of the traditional low-ascii alpha-numeric values where most character sets happen to use the same glyphs for that subset of ascii byte values.

If you speak the word "see" into a tape recorder,
And then play it back to a blindfolded, english-speaking, optometrist, they probably hear the word "see".
Play the same tape back to a blindfolded, english-speaking, sailor, and they probably hear "sea".
... spanish-speaker, probably hears "si".
... elglish speaking software developer probably hears "C".
etc etc etc...

So it is with computers and character sets, character encoding schemes, fonts.
There are some mechanisms for for putting data in context, so that when one program "speaks" a string of bytes, it also indicates what "language" those bytes are intended to be interpreted with. But those mechanisms are mostly recent developments and not fully implimented and not fully backwards compatible with older systems which had no such ability.

So in windows you create a filename with an lower-case-a-acute, on an windows pc, in the USA, for an english speaking user, the file-save dialog UI is _probably_ using utf16, utf8, or codepage 1252. In all of those cases the a-acute just _happens_ to be the same integer value 0xe1 (0x00e1 for utf16). So, at last on the windows pc local disk the filenme probably has the byte 0xe1 in it. Now you copy that file to the samba share. Lets assume for the moment that samba does no magic translation of the filename at save-time, and so it just copies the e1 without caring what glyph might be associated with that value.
Now you log in from the console or telnet in from a terminal emulator that is configured to accurately mimick the console.
The console may or may not be configured to load a software font over the the vga hardware.
In the USA, if the console is loading a font, it's probaly latin-1, aka iso-8859-1. I guess this was a bad example haracter, beacause in that charactr set too, once again 0xe1 just happens to be lower-case-a-acute.
But, the character set that is built-in to most vga hardware is not any of the character sets mentioned so far, it's codepage 437.
The glyph associated with the integer value 0xe1 in the codepage 437 character set is the alpha symbol.
If the console is configured not to do any character translating or software font loading and is thus using codepage 437, and if samba is not doing any translating, then when you ls that file name at the console, instead of lower-case-a-acute, you'll see an alpha.
But, almost any program in X on the same box is probably using latin-1. So, in nautilus file manager, you'll probably see the lower-case-a-acute.

The character hasn't changed and isnt "wrong" any where. Merely if you are going to use _any_ characters outside of the least common denominator set of plain lower-ascii (aka 7-bit us-ascii) then you need to understand all about character sets and display contexts. If you don't or can't do that, then stick to the plain characters. Most character sets have the same glyphs for that subset of byte values. (0x20 to 0x7e or decimal 31 to 126) (and less tangibly, the same sorting rules for those characters) 
In fact, regardless of how much you might learn about this, and how well & completely you might configure all the software on all the devices in your organization, so that maybe they all speak utf8, so everything is consistent everywhere, you should probaly _still_ avoid special characters because you can never know what character set someone else is using who may need to handle files (or even merely their names) created within your organization.

Brian K. White    brian@aljex.com    http://www.myspace.com/KEYofR
+++++[>+++[>+++++>+++++++<<-]<-]>>+.>.+++++.+++++++.-.[>+<---]>++.
filePro  BBx    Linux  SCO  FreeBSD    #callahans  Satriani  Filk!

-- 
To unsubscribe, e-mail: opensuse+unsubscribe@opensuse.org
For additional commands, e-mail: opensuse+help@opensuse.org