[oS-EN] Text conversion problem
Hi, In an old text file, which "file" says it is utf-8 text, there are some chars which are not. Some were obviously accented letters, like "á", so I just did a search and replace on them. There is one entry I don't know what it is: ············ rsync -a myfolder/*/*.jpg my-new-folder/ ············ (The file got corrupted at some point because one editor thought the file was utf-8, and another editor thought differently). I saved the bad part to a file, but the command "file" insists it is utf-8 cer@Telcontar:~> cat p ············ ············ cer@Telcontar:~> file p p: UTF-8 Unicode text cer@Telcontar:~> And I fail to force a conversion: cer@Telcontar:~> iconv -f LATIN6 -t UTF-8 p ÷÷÷÷÷÷÷÷÷÷÷÷ ÷÷÷÷÷÷÷÷÷÷÷÷ cer@Telcontar:~> Perhaps I'm guessing wrong the non utf encoding. I don't remember for sure which was the old latin encoding we used in Spain, too. My guess is that the string is ················ which is a centered dot, which in my keyboard (Spain) is on the [.] key, but pressing also [AltGr] Testing besides the bad string: ··· Ideas? -- Cheers / Saludos, Carlos E. R. (from Telcontar, using openSUSE Leap 15.4)
Dne pátek 21. července 2023 14:09:31 CEST, Carlos E. R. napsal(a):
In an old text file, which "file" says it is utf-8 text, there are some chars which are not.
Did You try "enca file.txt"? From my experience it's relatively successful. But if the file was overwritten under some wrong encoding, it might be hard. You can also guess from <https://en.wikipedia.org/wiki/Code_page> -- Vojtěch Zeisek https://trapa.cz/ Komunita openSUSE GNU/Linuxu Community of the openSUSE GNU/Linux https://www.opensuse.org/
On 2023-07-21 14:43, Vojtěch Zeisek wrote:
Dne pátek 21. července 2023 14:09:31 CEST, Carlos E. R. napsal(a):
In an old text file, which "file" says it is utf-8 text, there are some chars which are not.
Did You try "enca file.txt"? From my experience it's relatively successful. But if the file was overwritten under some wrong encoding, it might be hard. You can also guess from <https://en.wikipedia.org/wiki/Code_page>
I had to install it. cer@Telcontar:~> enca p enca: Cannot determine (or understand) your language preferences. Please use `-L language', or `-L none' if your language is not supported (only a few multibyte encodings can be recognized then). Run `enca --list languages' to get a list of supported languages. cer@Telcontar:~> Doesn't include Spanish... :-( cer@Telcontar:~> enca --list languages belarusian: CP1251 IBM866 ISO-8859-5 KOI8-UNI maccyr IBM855 KOI8-U bulgarian: CP1251 ISO-8859-5 IBM855 maccyr ECMA-113 czech: ISO-8859-2 CP1250 IBM852 KEYBCS2 macce KOI-8_CS_2 CORK estonian: ISO-8859-4 CP1257 IBM775 ISO-8859-13 macce baltic croatian: CP1250 ISO-8859-2 IBM852 macce CORK hungarian: ISO-8859-2 CP1250 IBM852 macce CORK lithuanian: CP1257 ISO-8859-4 IBM775 ISO-8859-13 macce baltic latvian: CP1257 ISO-8859-4 IBM775 ISO-8859-13 macce baltic polish: ISO-8859-2 CP1250 IBM852 macce ISO-8859-13 ISO-8859-16 baltic CORK russian: KOI8-R CP1251 ISO-8859-5 IBM866 maccyr slovak: CP1250 ISO-8859-2 IBM852 KEYBCS2 macce KOI-8_CS_2 CORK slovene: ISO-8859-2 CP1250 IBM852 macce CORK ukrainian: CP1251 IBM855 ISO-8859-5 CP1125 KOI8-U maccyr chinese: GBK BIG5 HZ none: cer@Telcontar:~> -- Cheers / Saludos, Carlos E. R. (from 15.4 x86_64 at Telcontar)
On 2023-07-21 20:36, Andrei Borzenkov wrote:
On 21.07.2023 15:09, Carlos E. R. wrote:
I saved the bad part to a file, but the command "file" insists it is utf-8
It is UTF-8
cer@Telcontar:~> cat p ············ ············
Yes, it is exactly the content of this file in UTF-8.
Yes, I know that now it is UTF-8, but originally it wasn't. It was probably ················ some editor thought the file (text, 154Kb) was not utf, saved it as utf-8, and corrupted some strings. That is one. I'll just assume it is a string of center dots "·", and be done. Right, done now. -- Cheers / Saludos, Carlos E. R. (from 15.4 x86_64 at Telcontar)
participants (4)
-
Andrei Borzenkov
-
Carlos E. R.
-
Carlos E.R.
-
Vojtěch Zeisek