[opensuse] uncompessing zip files and accented characters

newer
[opensuse] Re: [opensuse-factory]...

Istvan Gabor

12 Jul 2010 12 Jul '10

19:53

Hello: I am using openSUSE 11.2 with KDE 3.5.10. I would like to uncompress a zip file which presumably contains several files with Hungarian accented characters in file names. When I use 'unzip' command in a terminal window I get file names like this: K?zvet?t?v?laszt?sos The ? marks probabaly replace accented characters. How could I uncompress the zip file so that the original accented characters would be preserved? Thanks, Istvan -- To unsubscribe, e-mail: opensuse+unsubscribe@opensuse.org For additional commands, e-mail: opensuse+help@opensuse.org

Show replies by date

Ken Schneider - openSUSE

12 Jul 12 Jul

23:04

On 07/12/2010 03:53 PM, Istvan Gabor pecked at the keyboard and wrote:

...

Hello:

I am using openSUSE 11.2 with KDE 3.5.10. I would like to uncompress a zip file which presumably contains several files with Hungarian accented characters in file names. When I use 'unzip' command in a terminal window I get file names like this: K?zvet?t?v?laszt?sos

The ? marks probabaly replace accented characters.

How could I uncompress the zip file so that the original accented characters would be preserved?

Thanks, Istvan

Perhaps install the font in question? -- Ken Schneider SuSe since Version 5.2, June 1998 -- To unsubscribe, e-mail: opensuse+unsubscribe@opensuse.org For additional commands, e-mail: opensuse+help@opensuse.org

Istvan Gabor

13 Jul 13 Jul

08:47

2010. július 13. 1:04 napon Ken Schneider - openSUSE írta:

...

On 07/12/2010 03:53 PM, Istvan Gabor pecked at the keyboard and wrote:

...
Hello:

I am using openSUSE 11.2 with KDE 3.5.10. I would like to uncompress a zip file which presumably contains several files with Hungarian accented characters in file names. When I use 'unzip' command in a terminal window I get file names like this: K?zvet?t?v?laszt?sos

The ? marks probabaly replace accented characters.

How could I uncompress the zip file so that the original accented characters would be preserved?

Thanks, Istvan

Perhaps install the font in question?

I do not think that it is a font issue. I rather guess the zip/unzip program uses a different encoding when uncompresses the file. Is there a way to tell unzip which encoding to use? Istvan -- To unsubscribe, e-mail: opensuse+unsubscribe@opensuse.org For additional commands, e-mail: opensuse+help@opensuse.org

Илья Черных

08:55

New subject: Re[2]: [opensuse] uncompessing zip files and accented characters

Yes. Though the upstream fiercely rejects the patch arguing that Unicode should be used in archives and not the Windows encoding. -- To unsubscribe, e-mail: opensuse+unsubscribe@opensuse.org For additional commands, e-mail: opensuse+help@opensuse.org

Dave Howorth

09:24

Илья Черных wrote:

...

Yes. Though the upstream fiercely rejects the patch arguing that Unicode should be used in archives and not the Windows encoding.

I guess this was an answer to Istvan Gabor wrote:

...

I do not think that it is a font issue. I rather guess the zip/unzip program uses a different encoding when uncompresses the file. Is there a way to tell unzip which encoding to use?

but if so, I didn't find an option on the unzip man page. Could you be more explicit about how to get unzip to re-encode filenames, please? Thanks, Dave -- To unsubscribe, e-mail: opensuse+unsubscribe@opensuse.org For additional commands, e-mail: opensuse+help@opensuse.org

Илья Черных

09:39

New subject: Re[2]: [opensuse] uncompessing zip files and accented characters

...

but if so, I didn't find an option on the unzip man page. Could you be more explicit about how to get unzip to re-encode filenames, please?

As I said the upstrean fiercely rejects any patches thay would help hanle archives created with non-UTF encoding, such as created under Windows. -- To unsubscribe, e-mail: opensuse+unsubscribe@opensuse.org For additional commands, e-mail: opensuse+help@opensuse.org

Dave Howorth

09:46

Илья Черных wrote:

...

...
but if so, I didn't find an option on the unzip man page. Could you be more explicit about how to get unzip to re-encode filenames, please?

As I said the upstrean fiercely rejects any patches thay would help hanle archives created with non-UTF encoding, such as created under Windows.

So when you said 'Yes', you meant 'No'. Cheers, Dave PS Please don't send me copies of mails; I read the list. -- To unsubscribe, e-mail: opensuse+unsubscribe@opensuse.org For additional commands, e-mail: opensuse+help@opensuse.org

Jon Clausen

09:04

On Tue, 13 Jul, 2010 at 10:47:07 +0200, Istvan Gabor wrote:

...

2010. július 13. 1:04 napon Ken Schneider - openSUSE írta:

...
On 07/12/2010 03:53 PM, Istvan Gabor pecked at the keyboard and wrote:

...
Hello:

I am using openSUSE 11.2 with KDE 3.5.10. I would like to uncompress a zip file which presumably contains several files with Hungarian accented characters in file names. When I use 'unzip' command in a terminal window I get file names like this: K?zvet?t?v?laszt?sos

The ? marks probabaly replace accented characters.

How could I uncompress the zip file so that the original accented characters would be preserved?

Thanks, Istvan

Perhaps install the font in question?

I do not think that it is a font issue. I rather guess the zip/unzip program uses a different encoding when uncompresses the file. Is there a way to tell unzip which encoding to use?

Might be that the archive was created in an ISO-<something> environment? I'd suggest having a look at 'convmv' /jon -- YMMV -- To unsubscribe, e-mail: opensuse+unsubscribe@opensuse.org For additional commands, e-mail: opensuse+help@opensuse.org

Ken Schneider - openSUSE

12:34

On 07/13/2010 04:47 AM, Istvan Gabor pecked at the keyboard and wrote:

...

2010. július 13. 1:04 napon Ken Schneider - openSUSE írta:

...
On 07/12/2010 03:53 PM, Istvan Gabor pecked at the keyboard and wrote:

...
Hello:

I am using openSUSE 11.2 with KDE 3.5.10. I would like to uncompress a zip file which presumably contains several files with Hungarian accented characters in file names. When I use 'unzip' command in a terminal window I get file names like this: K?zvet?t?v?laszt?sos

The ? marks probabaly replace accented characters.

How could I uncompress the zip file so that the original accented characters would be preserved?

Thanks, Istvan

Perhaps install the font in question?

I do not think that it is a font issue. I rather guess the zip/unzip program uses a different encoding when uncompresses the file. Is there a way to tell unzip which encoding to use?

Istvan

PLEASE KEEP REPLIES TO THE LIST. IF YOUR INFERIOR EMAIL PROGRAM DOES NOT SUPPORT REPLY-TO-LIST HAVE ENOUGH COURTESY TO REMOVE THE PRIVATE EMAIL ADDRESS BEFORE HITTING SEND. PEOPLE ON THIS LIST DO NOT NEED TWO COPIES OF YOUR REPLY! -- Ken Schneider SuSe since Version 5.2, June 1998 -- To unsubscribe, e-mail: opensuse+unsubscribe@opensuse.org For additional commands, e-mail: opensuse+help@opensuse.org

Dave Howorth

08:50

Istvan Gabor wrote:

...

Hello:

I am using openSUSE 11.2 with KDE 3.5.10. I would like to uncompress a zip file which presumably contains several files with Hungarian accented characters in file names. When I use 'unzip' command in a terminal window I get file names like this: K?zvet?t?v?laszt?sos

The ? marks probabaly replace accented characters.

How could I uncompress the zip file so that the original accented characters would be preserved?

They already are, I think. Your problem is that your terminal's locale doesn't include them so it can't display them, so it substitutes question marks. You need to ensure that your terminal is using the same encoding that the filenames are in or else you need to rename all the files so that the names use the same encoding as your terminal. Probably the best solution for the long-term is to make everything utf-8. If your zip file is a one-off that you need to deal with and then you've finished with it, you could just set up a terminal session uing whatever encoding was used when the files were created and work within that environment. Cheers, Dave -- To unsubscribe, e-mail: opensuse+unsubscribe@opensuse.org For additional commands, e-mail: opensuse+help@opensuse.org

Philipp Thomas

09:29

* Dave Howorth (dhoworth@mrc-lmb.cam.ac.uk) [20100713 10:50]:

...

They already are, I think.

No they aren't. The zip format neither specifies an encoding to use nor does it offer a field that identifies the encoding. Thus unzip in its original form can't handle different encodings and you also can't specify the encoding. And as stated, upstream has rejected all patches up till now, stating that utf8 should be used. Right, as if any Win* user would be able to do so. To remedy the situation a bit I've accepted a patch to openSUSE's unzip that will decode russian and czech encodings. As librcc is extensible, maybe it could be extended to also handle hungarian file names sokmetime in the future. Philipp -- To unsubscribe, e-mail: opensuse+unsubscribe@opensuse.org For additional commands, e-mail: opensuse+help@opensuse.org

Dave Howorth

09:40

Philipp Thomas wrote:

...

* Dave Howorth (dhoworth@mrc-lmb.cam.ac.uk) [20100713 10:50]:

...
They already are, I think.

No they aren't.

Another cryptic posting! What do you mean by this? My understanding is that unzip does not alter the binary octets in the filenames, so in so far as a filename contains characters at all they are preserved in whatever character set and encoding was used to create them. Is that wrong? What has unzip transformed the filenames into if it hasn't preserved them?

...

The zip format neither specifies an encoding to use nor does it offer a field that identifies the encoding. Thus unzip in its original form can't handle different encodings and you also can't specify the encoding. And as stated, upstream has rejected all patches up till now, stating that utf8 should be used. Right, as if any Win* user would be able to do so.

To remedy the situation a bit I've accepted a patch to openSUSE's unzip that will decode russian and czech encodings. As librcc is extensible, maybe it could be extended to also handle hungarian file names sokmetime in the future.

Philipp

-- To unsubscribe, e-mail: opensuse+unsubscribe@opensuse.org For additional commands, e-mail: opensuse+help@opensuse.org

Philipp Thomas

10:14

* Dave Howorth (dhoworth@mrc-lmb.cam.ac.uk) [20100713 11:40]:

...

Is that wrong? What has unzip transformed the filenames into if it hasn't preserved them?

Ok, once again: Zip will write the names to the archive in whatever encoding the originating machine uses. However it will *not* record the encoding used. So in this case unzip will read the names encoded in say latin-2 (a single byte encoding) and will write them as utf8 (a multi byte encoding) which of cause will result in the gibberish the OP posted. Preserving the encoding is useless unless you know the original one and can thus convert them to the encoding used on the machine used for unpacking the archive. The names created by unzip on unpacking can not be converted anymore! Now a bit clearer? Philipp -- To unsubscribe, e-mail: opensuse+unsubscribe@opensuse.org For additional commands, e-mail: opensuse+help@opensuse.org

Per Jessen

10:29

Philipp Thomas wrote:

...

* Dave Howorth (dhoworth@mrc-lmb.cam.ac.uk) [20100713 11:40]:

...
Is that wrong? What has unzip transformed the filenames into if it hasn't preserved them?

Ok, once again: Zip will write the names to the archive in whatever encoding the originating machine uses. However it will *not* record the encoding used. So in this case unzip will read the names encoded in say latin-2 (a single byte encoding) and will write them as utf8 (a multi byte encoding) which of cause will result in the gibberish the OP posted.

Isn't it rather than unzip simply dumps whatever filenames that were zipped, and that the terminal attempts to display those names as if they are utf8? Or does zip really convert from (for instance) latin-2 to utf8 ?? -- Per Jessen, Zürich (23.9°C) -- To unsubscribe, e-mail: opensuse+unsubscribe@opensuse.org For additional commands, e-mail: opensuse+help@opensuse.org

Dave Howorth

10:50

Per Jessen wrote:

...

Philipp Thomas wrote:

...
* Dave Howorth (dhoworth@mrc-lmb.cam.ac.uk) [20100713 11:40]:

...
Is that wrong? What has unzip transformed the filenames into if it hasn't preserved them? Ok, once again: Zip will write the names to the archive in whatever encoding the originating machine uses. However it will *not* record the encoding used. So in this case unzip will read the names encoded in say latin-2 (a single byte encoding) and will write them as utf8 (a multi byte encoding) which of cause will result in the gibberish the OP posted.

Isn't it rather than unzip simply dumps whatever filenames that were zipped, and that the terminal attempts to display those names as if they are utf8? Or does zip really convert from (for instance) latin-2 to utf8 ??

Exactly, as far as I know filenames are stored in the filesystem as octets. There's no notion of characters or encodings. Neither does the kernel care what the octet sequence represents. The semantics is added by application layers above that. Talk of encoding in the filenames themselves is muddled thinking, AFAIK. I believe that unzip simply copies the octet sequence that is the filename. So they can be read 'sensibly' by any application running in an environment that uses the same character set and encoding, if that is all that is required. OTOH, if the requirement is to use the files with arbitrary applications running in a utf-8 environment (which is probably the default in any recently built system) then the filenames need to be changed such that the sequence of octets represents a utf-8 encoding. As has been suggested, convmv is a way to do that. Cheers, Dave -- To unsubscribe, e-mail: opensuse+unsubscribe@opensuse.org For additional commands, e-mail: opensuse+help@opensuse.org

Philipp Thomas

12:32

* Dave Howorth (dhoworth@mrc-lmb.cam.ac.uk) [20100713 12:50]:

...

Exactly, as far as I know filenames are stored in the filesystem as octets.

Correct so far.

...

There's no notion of characters or encodings.

That's not correct.

...

Neither does the kernel care what the octet sequence represents.

Wrong! File system drivers like ntfs or vfat explicitely use specific encodings.

...

Talk of encoding in the filenames themselves is muddled thinking.

I tend to disagree.

...

As has been suggested, convmv is a way to do that.

Convmv can help may be able to convert the file name on disk but it won't change unzip's display. Philipp -- To unsubscribe, e-mail: opensuse+unsubscribe@opensuse.org For additional commands, e-mail: opensuse+help@opensuse.org

Dave Howorth

13:45

Philipp Thomas wrote:

...

* Dave Howorth (dhoworth@mrc-lmb.cam.ac.uk) [20100713 12:50]:

...
Exactly, as far as I know filenames are stored in the filesystem as octets.

Correct so far.

...
There's no notion of characters or encodings.

That's not correct.

OK, I hold my hand up. Please just give me a reference to the place where the encodings are defined so I can learn.

...

...
Neither does the kernel care what the octet sequence represents.

Wrong! File system drivers like ntfs or vfat explicitely use specific encodings.

Hmm, so Microsoft break software layering abstractions. Why does that not surprise me? Isn't ntfs-3g a user-level driver though?

...

...
Talk of encoding in the filenames themselves is muddled thinking.

I tend to disagree.

These articles are quite old so I thought at first my beliefs were just out of date: http://lwn.net/Articles/71472/ http://www.win.tue.nl/~aeb/linux/lk/lk-6.html But this is 2010-05-23 http://www.dwheeler.com/essays/fixing-unix-linux-filenames.html "Yet because you can’t know the character encoding of a given filename, in theory you can’t display filenames at all today. Why? Because then you don’t know how to translate the bytes of a filename into displayable characters (!)."

...

...
As has been suggested, convmv is a way to do that.

Convmv can help may be able to convert the file name on disk but it won't change unzip's display.

Indeed. Setting the appropriate environment, specifically locale, in which to run unzip is the way to do that. Cheers, Dave PS I'm not trying to argue that the filename architecture is the best way to design it, just what it is. -- To unsubscribe, e-mail: opensuse+unsubscribe@opensuse.org For additional commands, e-mail: opensuse+help@opensuse.org

Philipp Thomas

14:04

* Dave Howorth (dhoworth@mrc-lmb.cam.ac.uk) [20100713 15:45]:

...

OK, I hold my hand up. Please just give me a reference to the place where the encodings are defined so I can learn.

Sorry I slightly misunderstood that sentence. On rereading you are right that there is no notion of encoding in a sequence of bytes.

...

Indeed. Setting the appropriate environment, specifically locale, in which to run unzip is the way to do that.

Setting locale alone won't help you. You'd have to know the encoding of the zipped filenames and then start a shell with the correct locale and load the right console character set. When you run unzip in that shell it would probably display the file names correctly. No, the IMNSHO only way to handle this issue correctly would be to incorporate some kind of automatic encoding detection (like what firefox uses to determine web page encoding) like we have in the openSUSE version for russian and czech names. Philipp -- To unsubscribe, e-mail: opensuse+unsubscribe@opensuse.org For additional commands, e-mail: opensuse+help@opensuse.org

Istvan Gabor

15 Jul 15 Jul

12:58

Thank you all for the many responses. I have tried to convert the unzipped file name using convmv. This worked but some characters were converted incorrectly. Namely the ő characted was incorrectly converted to ď. I also tried to change encoding setting of KDE3 konsole, and also started new shell with command "LANG=hu_HU.iso-8859-2"; none of them helped. By the way the file I wanted t unzip is located here: http://www.tvnetwork.hu/static/docs/aszf_2010_julius_1-tol.zip Thank you again, Istvan -- To unsubscribe, e-mail: opensuse+unsubscribe@opensuse.org For additional commands, e-mail: opensuse+help@opensuse.org

Per Jessen

13:41

Istvan Gabor wrote:

...

Thank you all for the many responses.

I have tried to convert the unzipped file name using convmv. This worked but some characters were converted incorrectly. Namely the ő characted was incorrectly converted to ď.

I also tried to change encoding setting of KDE3 konsole, and also started new shell with command "LANG=hu_HU.iso-8859-2"; none of them helped.

Do you have a locale like that? My machine doesn't.

...

By the way the file I wanted t unzip is located here: http://www.tvnetwork.hu/static/docs/aszf_2010_julius_1-tol.zip

I tried the following on my 11.3 test system: LANG=hu_HU.utf8 unzip -l aszf_2010_julius_1-tol.zip See screenshot: http://public.jessen.ch/files/unzip-screenshot.jpg (I can't tell if it looks right or not). -- Per Jessen, Zürich (25.7°C) -- To unsubscribe, e-mail: opensuse+unsubscribe@opensuse.org For additional commands, e-mail: opensuse+help@opensuse.org

Philipp Thomas

13 Jul 13 Jul

12:21

* Per Jessen (per@opensuse.org) [20100713 12:29]:

...

Isn't it rather than unzip simply dumps whatever filenames that were zipped, and that the terminal attempts to display those names as if they are utf8?

Yes, I should have been more precise :( Unzip just dumps the filename as recorded in the zipfile and the terminal attepts to display them as if they were encoded in whatever LC_CTYPE is set to.

...

Or does zip really convert from (for instance) latin-2 to utf8 ??

If it did that there would be no problem :) Philipp -- To unsubscribe, e-mail: opensuse+unsubscribe@opensuse.org For additional commands, e-mail: opensuse+help@opensuse.org

5040

Age (days ago)

5043

Last active (days ago)

List overview

Download

20 comments

7 participants

participants (7)

Dave Howorth
Istvan Gabor
Jon Clausen
Ken Schneider - openSUSE
Per Jessen
Philipp Thomas
Илья Черных