Christian Boltz wrote:
Hello,
Am Dienstag, 10. März 2020, 12:00:10 CET schrieb Per Jessen:
Per Jessen wrote:
I think the next step is to try out the scripts that Malcolm sent a link to.
FWIW, I quickly hit the same issue Christian reported on IRC - duplicate key in vb_tag. "menu" is "menü" unless we use collate utf8mb4_bin
Then you should probably use this collation ;-)
I did try it, and it does fix the vb_tag problem.
Looking at the database export from Provo, I see:
... für wen Tumbleweed empfohlen wird, möchte ich mir ... ... wieder benützen möchtest und falls er überhaupt ... Ankündigungen und Neues ... habe ich eine Änderung übersehen
All clearly _intended_ as UTF8.
I assume your editor was in utf8 mode? I'll assume "yes".
I was using 'less', but yes, everything is in utf8 mode here.
This looks double-encoded. (Just to be sure, grep the relevant line from the sqldump and pipe this line into "file -".)
/dev/stdin: UTF-8 Unicode text, with very long lines
Do you know the date when this was posted?
Nope, I don't klnow the database that well.
Elsewhere, I see e.g.:
Kurz: Ändert bitte bis spätestens 31.03.2013
Clearly correct UTF8.
Right. (Also pipe this line into "file -" to be sure.)
/dev/stdin: UTF-8 Unicode text, with very long lines
What a mess - or can someone explain why it should look like that? Wrong options to mysqldump ?
Wrong mysqldump options are unlikely IMHO - I'm not aware of an option that could break the encoding in _half of_ the dump. I have a guess (based on something I hit on my own server), but I'd love to be wrong - otherwise we'll have *lots of* fun to get everything fixed. Sadly what you describe sounds like I might be right.
Yeah, I've been getting increasingly depressed after I discovered this. I have managed to fix other such screw-ups (elsewhere), but only with the regular European accents and umlauts. Not Cyrillic and Chinese.
Options I see:
- import everything in a way that the latest posts look correct (and ignore that posts from some years ago might have broken encoding). That's not perfect, but might be good enough.
That would also be my preferred way out, but I'm not sure it is possible - once the upgrade has completed, the first visible issue is with the "Ankündingungen und Neues" under "Community/Talk" : Currently, it displays the double encoded 'ü', but the link works. This is without using an explicit client side character set, which means defaulting to 'utf8mb4'. If I change client side charset to 'latin1', the "Ankündingungen und Neues" is correctly displayed, but the link is wrong. Grep'ing through the database export, I see 114 lines of "Ankündigungen und Neues", and 34 with 'ü'.
The goal should be to have the correct encoding in the database (so that the mysql shell client gives you a correctly encoded result).
Right. -- Per Jessen, Zürich (5.9°C) Member, openSUSE Heroes -- To unsubscribe, e-mail: heroes+unsubscribe@opensuse.org To contact the owner, e-mail: heroes+owner@opensuse.org