[heroes] forums migration status
Good morning, I suppose I have _some_ progress to report - after a lot of trail&error and help and suggestions, at least the front page can be made to look correct. Unfortunately that is only half the problem. Here is what I did - Yesterday I ran a last ditch import, with no conversions, just the raw data from Provo in Latin1. I figured that had to work when it works in Provo. Maybe we can consider a correct move/conversion to UTF8 later on. Then I ran the upgrade - with the stand-alone database, this can now be done in less than three hours. With practice, it is also easily repeated. The process warned me "that I should probably set the client charset to utf8mb4", so I did. (mysqli property set in php, and also the default in mysql anyway). I started up the forums this morning, corrected a couple of settings and moved it to https://forums-nbg.o.o. First attempt - it was still showing what is clearly UTF8 characters on the front page. Easily spotted in some German text, and the links to external forums in Español and Français. I decided to try removing the client charset and just let it default (to whatever). The pages now look fine, but the links don't work. Switch back to client charset = utf8, and the links work, but the text is wrong. Change the client charset to "latin1" and the page looks fine, the links don't. I have been staring at this for three-four days, I am having trouble seeing the forest for the trees. I think the next step is to try out the scripts that Malcolm sent a link to. -- Per Jessen, Zürich (5.9°C) Member, openSUSE Heroes -- To unsubscribe, e-mail: heroes+unsubscribe@opensuse.org To contact the owner, e-mail: heroes+owner@opensuse.org
Per Jessen wrote:
I think the next step is to try out the scripts that Malcolm sent a link to.
FWIW, I quickly hit the same issue Christian reported on IRC - duplicate key in vb_tag. "menu" is "menü" unless we use collate utf8mb4_bin Looking at the database export from Provo, I see: ... für wen Tumbleweed empfohlen wird, möchte ich mir ... ... wieder benützen möchtest und falls er überhaupt ... Ankündigungen und Neues ... habe ich eine Änderung übersehen All clearly _intended_ as UTF8. Elsewhere, I see e.g.: Kurz: Ändert bitte bis spätestens 31.03.2013 Clearly correct UTF8. I also see what I believe to be Chinese, but as gobbledegook, not in UTF8. What a mess - or can someone explain why it should look like that? Wrong options to mysqldump ? -- Per Jessen, Zürich (6.3°C) Member, openSUSE Heroes -- To unsubscribe, e-mail: heroes+unsubscribe@opensuse.org To contact the owner, e-mail: heroes+owner@opensuse.org
Hello, Am Dienstag, 10. März 2020, 12:00:10 CET schrieb Per Jessen:
Per Jessen wrote:
I think the next step is to try out the scripts that Malcolm sent a link to.
FWIW, I quickly hit the same issue Christian reported on IRC - duplicate key in vb_tag. "menu" is "menü" unless we use collate utf8mb4_bin
Then you should probably use this collation ;-)
Looking at the database export from Provo, I see:
... für wen Tumbleweed empfohlen wird, möchte ich mir ... ... wieder benützen möchtest und falls er überhaupt ... Ankündigungen und Neues ... habe ich eine Änderung übersehen
All clearly _intended_ as UTF8.
I assume your editor was in utf8 mode? I'll assume "yes". This looks double-encoded. (Just to be sure, grep the relevant line from the sqldump and pipe this line into "file -".) Do you know the date when this was posted?
Elsewhere, I see e.g.:
Kurz: Ändert bitte bis spätestens 31.03.2013
Clearly correct UTF8.
Right. (Also pipe this line into "file -" to be sure.)
I also see what I believe to be Chinese, but as gobbledegook, not in UTF8.
What a mess - or can someone explain why it should look like that? Wrong options to mysqldump ?
I run opensuse 12.3 This is opensuse-factory@, you shouldn't come here with something
Wrong mysqldump options are unlikely IMHO - I'm not aware of an option that could break the encoding in _half of_ the dump. I have a guess (based on something I hit on my own server), but I'd love to be wrong - otherwise we'll have *lots of* fun to get everything fixed. Sadly what you describe sounds like I might be right. My guess is that a forum update some years ago changed something in the encoding (the client charset and/or the encoding vB uses when talking to the database). If I'm right, this means that the encoding of "old" and "newer" posts is different. You should be able to find the date of the encoding change by looking at a (ideally busy) non-english forum. You should see the same problem on forums.o.o. Getting this fixed will be, well, interesting[tm]. When I hit a similar issue on my own server, it luckily only affected 20 entries in a small table, and I fixed it manually. Needless to say that this won't be possible for the forums (unless someone is _really_ bored ;-) Options I see: - import everything in a way that the latest posts look correct (and ignore that posts from some years ago might have broken encoding). That's not perfect, but might be good enough. - parse the sqldump with a script that checks for double-encoded utf8 and prints out a fixed dump with correct encoding. Sadly I have no idea if such a script exists :-( - the only thing I found is https://metacpan.org/pod/Encoding::FixLatin (probably can't fix double encoding) - like the previous option, but do it inside the database (with CONVERT()). That's as tricky as the previous option (and we'll have to identify all tables and columns that need to be fixed), but maybe we can use a WHERE clause based on the date. https://stackoverflow.com/questions/11436594/ might help (untested). The goal should be to have the correct encoding in the database (so that the mysql shell client gives you a correctly encoded result). Everything else will cause additional pain with running vB and later exporting to $replacement. (And yes, I know that stating this goal is much easier than reaching it ;-) Regards, Christian Boltz PS: Somewhat related, but I only scrolled over that page: http://mysql.rjweb.org/doc.php/charcoll -- that old. :-) [> Matwey V. Kornilov and Andreas Schwab in opensuse-factory] -- To unsubscribe, e-mail: heroes+unsubscribe@opensuse.org To contact the owner, e-mail: heroes+owner@opensuse.org
Op dinsdag 10 maart 2020 15:26:37 CET schreef Christian Boltz:
Hello,
Am Dienstag, 10. März 2020, 12:00:10 CET schrieb Per Jessen:
Per Jessen wrote:
I think the next step is to try out the scripts that Malcolm sent a link to.
FWIW, I quickly hit the same issue Christian reported on IRC - duplicate key in vb_tag. "menu" is "menü" unless we use collate utf8mb4_bin
Then you should probably use this collation ;-)
Looking at the database export from Provo, I see:
... für wen Tumbleweed empfohlen wird, möchte ich mir ... ... wieder benützen möchtest und falls er überhaupt ... Ankündigungen und Neues ... habe ich eine Änderung übersehen
All clearly _intended_ as UTF8.
I assume your editor was in utf8 mode? I'll assume "yes".
This looks double-encoded. (Just to be sure, grep the relevant line from the sqldump and pipe this line into "file -".)
Do you know the date when this was posted?
Elsewhere, I see e.g.:
Kurz: Ändert bitte bis spätestens 31.03.2013
Clearly correct UTF8.
Right. (Also pipe this line into "file -" to be sure.)
I also see what I believe to be Chinese, but as gobbledegook, not in UTF8.
What a mess - or can someone explain why it should look like that? Wrong options to mysqldump ?
Wrong mysqldump options are unlikely IMHO - I'm not aware of an option that could break the encoding in _half of_ the dump.
I have a guess (based on something I hit on my own server), but I'd love to be wrong - otherwise we'll have *lots of* fun to get everything fixed. Sadly what you describe sounds like I might be right.
My guess is that a forum update some years ago changed something in the encoding (the client charset and/or the encoding vB uses when talking to the database).
If I'm right, this means that the encoding of "old" and "newer" posts is different. You should be able to find the date of the encoding change by looking at a (ideally busy) non-english forum. You should see the same problem on forums.o.o.
Getting this fixed will be, well, interesting[tm].
When I hit a similar issue on my own server, it luckily only affected 20 entries in a small table, and I fixed it manually. Needless to say that this won't be possible for the forums (unless someone is _really_ bored ;-)
Options I see: - import everything in a way that the latest posts look correct (and ignore that posts from some years ago might have broken encoding). That's not perfect, but might be good enough. - parse the sqldump with a script that checks for double-encoded utf8 and prints out a fixed dump with correct encoding. Sadly I have no idea if such a script exists :-( - the only thing I found is https://metacpan.org/pod/Encoding::FixLatin (probably can't fix double encoding) - like the previous option, but do it inside the database (with CONVERT()). That's as tricky as the previous option (and we'll have to identify all tables and columns that need to be fixed), but maybe we can use a WHERE clause based on the date. https://stackoverflow.com/questions/11436594/ might help (untested).
The goal should be to have the correct encoding in the database (so that the mysql shell client gives you a correctly encoded result). Everything else will cause additional pain with running vB and later exporting to $replacement. (And yes, I know that stating this goal is much easier than reaching it ;-)
Regards,
Christian Boltz
PS: Somewhat related, but I only scrolled over that page: http://mysql.rjweb.org/doc.php/charcoll
I run opensuse 12.3
This is opensuse-factory@, you shouldn't come here with something that old. :-) [> Matwey V. Kornilov and Andreas Schwab in opensuse-factory] From memory, but this might be related: we saw these encoding problems ( at least ) when the Russian subforums were started, and again when the Polish ones were. Kim fixed those issues, but no idea how.
-- Gertjan Lettink a.k.a. Knurpht openSUSE Forums Team -- To unsubscribe, e-mail: heroes+unsubscribe@opensuse.org To contact the owner, e-mail: heroes+owner@opensuse.org
Christian Boltz wrote:
Hello,
Am Dienstag, 10. März 2020, 12:00:10 CET schrieb Per Jessen:
Per Jessen wrote:
I think the next step is to try out the scripts that Malcolm sent a link to.
FWIW, I quickly hit the same issue Christian reported on IRC - duplicate key in vb_tag. "menu" is "menü" unless we use collate utf8mb4_bin
Then you should probably use this collation ;-)
I did try it, and it does fix the vb_tag problem.
Looking at the database export from Provo, I see:
... für wen Tumbleweed empfohlen wird, möchte ich mir ... ... wieder benützen möchtest und falls er überhaupt ... Ankündigungen und Neues ... habe ich eine Änderung übersehen
All clearly _intended_ as UTF8.
I assume your editor was in utf8 mode? I'll assume "yes".
I was using 'less', but yes, everything is in utf8 mode here.
This looks double-encoded. (Just to be sure, grep the relevant line from the sqldump and pipe this line into "file -".)
/dev/stdin: UTF-8 Unicode text, with very long lines
Do you know the date when this was posted?
Nope, I don't klnow the database that well.
Elsewhere, I see e.g.:
Kurz: Ändert bitte bis spätestens 31.03.2013
Clearly correct UTF8.
Right. (Also pipe this line into "file -" to be sure.)
/dev/stdin: UTF-8 Unicode text, with very long lines
What a mess - or can someone explain why it should look like that? Wrong options to mysqldump ?
Wrong mysqldump options are unlikely IMHO - I'm not aware of an option that could break the encoding in _half of_ the dump. I have a guess (based on something I hit on my own server), but I'd love to be wrong - otherwise we'll have *lots of* fun to get everything fixed. Sadly what you describe sounds like I might be right.
Yeah, I've been getting increasingly depressed after I discovered this. I have managed to fix other such screw-ups (elsewhere), but only with the regular European accents and umlauts. Not Cyrillic and Chinese.
Options I see: - import everything in a way that the latest posts look correct (and ignore that posts from some years ago might have broken encoding). That's not perfect, but might be good enough.
That would also be my preferred way out, but I'm not sure it is possible - once the upgrade has completed, the first visible issue is with the "Ankündingungen und Neues" under "Community/Talk" : Currently, it displays the double encoded 'ü', but the link works. This is without using an explicit client side character set, which means defaulting to 'utf8mb4'. If I change client side charset to 'latin1', the "Ankündingungen und Neues" is correctly displayed, but the link is wrong. Grep'ing through the database export, I see 114 lines of "Ankündigungen und Neues", and 34 with 'ü'.
The goal should be to have the correct encoding in the database (so that the mysql shell client gives you a correctly encoded result).
Right. -- Per Jessen, Zürich (5.9°C) Member, openSUSE Heroes -- To unsubscribe, e-mail: heroes+unsubscribe@opensuse.org To contact the owner, e-mail: heroes+owner@opensuse.org
On 10/03/2020 17.21, Per Jessen wrote: ...
Yeah, I've been getting increasingly depressed after I discovered this. I have managed to fix other such screw-ups (elsewhere), but only with the regular European accents and umlauts. Not Cyrillic and Chinese.
Chinese can only be utf, I understand -- Cheers / Saludos, Carlos E. R. (from oS Leap 15.0 x86_64 (Minas Tirith))
Hello, On Tue, 10 Mar 2020, Carlos E. R. wrote:
On 10/03/2020 17.21, Per Jessen wrote:
...
Yeah, I've been getting increasingly depressed after I discovered this. I have managed to fix other such screw-ups (elsewhere), but only with the regular European accents and umlauts. Not Cyrillic and Chinese.
Chinese can only be utf, I understand
Hmm? At least GB18030 (plus predecessors) and Big5 (plus variants) are in wide use. (I don't know if the forums support those of course, but as both encodings are POSIX compatible I don't see why they wouldn't). Ciao, Michael. -- To unsubscribe, e-mail: heroes+unsubscribe@opensuse.org To contact the owner, e-mail: heroes+owner@opensuse.org
Michael Matz wrote:
Hello,
On Tue, 10 Mar 2020, Carlos E. R. wrote:
On 10/03/2020 17.21, Per Jessen wrote:
...
Yeah, I've been getting increasingly depressed after I discovered this. I have managed to fix other such screw-ups (elsewhere), but only with the regular European accents and umlauts. Not Cyrillic and Chinese.
Chinese can only be utf, I understand
Hmm? At least GB18030 (plus predecessors) and Big5 (plus variants) are in wide use. (I don't know if the forums support those of course, but as both encodings are POSIX compatible I don't see why they wouldn't).
The Chinese characters are double encoded UTF8 (probably), but I think I have managed to find the postings in http://forums.o.o, and there they are clearly Chinese. -- Per Jessen, Zürich (6.6°C) Member, openSUSE Heroes -- To unsubscribe, e-mail: heroes+unsubscribe@opensuse.org To contact the owner, e-mail: heroes+owner@opensuse.org
participants (5)
-
Carlos E. R.
-
Christian Boltz
-
Knurpht-openSUSE
-
Michael Matz
-
Per Jessen