Hello, Am Dienstag, 10. März 2020, 12:00:10 CET schrieb Per Jessen:
Per Jessen wrote:
I think the next step is to try out the scripts that Malcolm sent a link to.
FWIW, I quickly hit the same issue Christian reported on IRC - duplicate key in vb_tag. "menu" is "menü" unless we use collate utf8mb4_bin
Then you should probably use this collation ;-)
Looking at the database export from Provo, I see:
... für wen Tumbleweed empfohlen wird, möchte ich mir ... ... wieder benützen möchtest und falls er überhaupt ... Ankündigungen und Neues ... habe ich eine Änderung übersehen
All clearly _intended_ as UTF8.
I assume your editor was in utf8 mode? I'll assume "yes". This looks double-encoded. (Just to be sure, grep the relevant line from the sqldump and pipe this line into "file -".) Do you know the date when this was posted?
Elsewhere, I see e.g.:
Kurz: Ändert bitte bis spätestens 31.03.2013
Clearly correct UTF8.
Right. (Also pipe this line into "file -" to be sure.)
I also see what I believe to be Chinese, but as gobbledegook, not in UTF8.
What a mess - or can someone explain why it should look like that? Wrong options to mysqldump ?
I run opensuse 12.3 This is opensuse-factory@, you shouldn't come here with something
Wrong mysqldump options are unlikely IMHO - I'm not aware of an option that could break the encoding in _half of_ the dump. I have a guess (based on something I hit on my own server), but I'd love to be wrong - otherwise we'll have *lots of* fun to get everything fixed. Sadly what you describe sounds like I might be right. My guess is that a forum update some years ago changed something in the encoding (the client charset and/or the encoding vB uses when talking to the database). If I'm right, this means that the encoding of "old" and "newer" posts is different. You should be able to find the date of the encoding change by looking at a (ideally busy) non-english forum. You should see the same problem on forums.o.o. Getting this fixed will be, well, interesting[tm]. When I hit a similar issue on my own server, it luckily only affected 20 entries in a small table, and I fixed it manually. Needless to say that this won't be possible for the forums (unless someone is _really_ bored ;-) Options I see: - import everything in a way that the latest posts look correct (and ignore that posts from some years ago might have broken encoding). That's not perfect, but might be good enough. - parse the sqldump with a script that checks for double-encoded utf8 and prints out a fixed dump with correct encoding. Sadly I have no idea if such a script exists :-( - the only thing I found is https://metacpan.org/pod/Encoding::FixLatin (probably can't fix double encoding) - like the previous option, but do it inside the database (with CONVERT()). That's as tricky as the previous option (and we'll have to identify all tables and columns that need to be fixed), but maybe we can use a WHERE clause based on the date. https://stackoverflow.com/questions/11436594/ might help (untested). The goal should be to have the correct encoding in the database (so that the mysql shell client gives you a correctly encoded result). Everything else will cause additional pain with running vB and later exporting to $replacement. (And yes, I know that stating this goal is much easier than reaching it ;-) Regards, Christian Boltz PS: Somewhat related, but I only scrolled over that page: http://mysql.rjweb.org/doc.php/charcoll -- that old. :-) [> Matwey V. Kornilov and Andreas Schwab in opensuse-factory] -- To unsubscribe, e-mail: heroes+unsubscribe@opensuse.org To contact the owner, e-mail: heroes+owner@opensuse.org